Text-to-speech (TTS) voice generation
POST /api/v1/creative-hub/generate/tts accepts elevenlabs / openai_tts. Body: text + voice_id + language + provider + drive_folder_id (no settings field). Audio output for compositing.
Written By Salvatore Sinigaglia
Last updated About 4 hours ago
POST /api/v1/creative-hub/generate/tts accepts elevenlabs / openai_tts. Body: text + voice_id + language + provider + drive_folder_id (no settings field). Audio output for compositing.
Text-to-speech (TTS) voice generation
POST /api/v1/creative-hub/generate/tts (verified
apps/backend/src/routes/api/creative-hub-generate.route.ts). This route accepts elevenlabs (high quality, multilingual, voice cloning) and openai_tts (good baseline, faster + cheaper). (The provider catalog inapps/backend/src/providers/creative/types.tsalso listsheygenas a TTS provider, used from the studio suite.) Body:text,voice_id,language,provider,drive_folder_id— there is nosettingsfield. Async via the TTS worker. Output: an audio file in your Drive folder.
Who is this for
Mediabuyers needing voice-over for video ads, multilingual narration, or accessibility audio. Especially used as input to compositing (ch-117).
The 2 providers
elevenlabs
Best for: premium voice quality, multilingual, voice cloning.
Strengths:
- Highest fidelity (sounds most human)
- Excellent multilingual support
- Voice cloning available (record your spokesperson's voice → clone → generate any text in their voice)
- Emotional tone control
Weaknesses:
- More expensive
- Slower per character
openai_tts
Best for: fast baseline TTS, cost-conscious workflows, English-first content.
Strengths:
- Fast generation
- Lower cost
- Stable quality for English
Weaknesses:
- Less natural than ElevenLabs in side-by-side
- Limited multilingual range
How to generate
Step 1: Open the generator
/creative-hub → AI Generate → TTS tab.
Step 2: Pick a provider
Dropdown. Default may be openai_tts or workspace-configured.
Step 3: Write the text
Text field. The narration script.
Best practices for TTS-friendly text:
- Short sentences (under 20 words)
- Natural punctuation (commas + periods drive pauses)
- Avoid all-caps (some providers shout it)
- Phonetic spelling for tricky names ("acme-AY-mee" if pronunciation differs from spelling)
Step 4: Pick voice_id
Each provider has its own voice library:
- ElevenLabs: dozens of stock voices + cloned voices (
voice_idis their identifier) - OpenAI TTS: handful of named voices (alloy, echo, fable, onyx, nova, shimmer)
Per-locale: pick a voice native to the target language for natural accent.
Step 5: Set language
Language code (en, it, es, fr, de, etc.). Affects pronunciation rules + intonation.
Step 6: (voice + language only)
The /generate/tts route takes text, voice_id, language, and provider — it does not accept a fine-grained settings object (stability, similarity, speed). Pick the voice and language that best fit; for finer voice controls use the provider's own tooling or the studio suite.
Step 7: Submit
Click Generate. Returns 202 Accepted + job_id.
Step 8: Track + download
TTS is fast: 5-30 sec typical.
Once completed: audio file (MP3) in your Drive folder. Use directly OR pair with video via compositing.
Endpoint
POST /api/v1/creative-hub/generate/tts (verified).
Body:
text(required, 1-5000 chars)provider(optional,elevenlabs|openai_tts)voice_id(optional)language(optional, e.g. en, it)drive_folder_id(optional)
There is no settings field on this route. Returns 202 + job_id. Worker calls the upstream provider, downloads the audio, stores it in Drive, and marks the job completed. Poll GET /api/v1/creative-hub/generate/jobs/:id.
Cost
TTS cost is generally low (charged per character or per token). ElevenLabs > OpenAI TTS.
For long-form content (1 min+ narration): cost adds up — consider OpenAI TTS for iteration, ElevenLabs for final.
See ch-112 AI credits.
Multilingual workflow
Same script translated → same voice_id (or per-locale matched voice) → N audio variants for N languages.
Common pattern:
- Write English script
- Translate via external tool (or human)
- Generate TTS in each language with appropriate
voice_id+language - Pair each audio with video (often the same video, multiple audio tracks via compositing)
Result: localized ads from a single video asset.
Voice cloning (ElevenLabs)
For a custom voice (your spokesperson, brand voice):
- Upload voice samples to ElevenLabs UI (their consent + capture flow)
- ElevenLabs creates a cloned voice
- Reference the new
voice_idin Wevion
Allows scaling spokesperson voice across hundreds of ad variants without recording each one.
Best practices
Test short before long
Generate 1-2 sentences to validate voice + pronunciation. Then generate full script.
Use punctuation deliberately
Commas = short pause. Periods = full pause. Em-dashes — like this — = thoughtful pause. TTS respects these.
Match voice to ad mood
- Energetic ads: faster voice with enthusiasm
- Trust / authority ads: deeper voice with measured pace
- Casual / friendly: lighter voice, conversational
Pair with compositing immediately
Audio alone has limited use. Pair with video via ch-117 compositing for ad-ready output.
Common mistakes
- Generating long monologue: cut into shorter sections for better pacing + easier iteration
- Wrong language for voice: voice trained on English speaking Italian = strange accent. Use locale-matched voice.
- Treating TTS like talking-head video: TTS is audio only; you still need visuals (video or static image carousel)
- Skipping iteration: first TTS generation rarely perfect; iterate on punctuation + pacing
Common issues
- Mispronunciation of brand name: use phonetic spelling in the input
- Tone too monotone: try ElevenLabs with higher style exaggeration; or OpenAI TTS with a more expressive voice
- Audio cuts off mid-sentence: text length may have hit provider limit; split into multiple shorter generations
- Languages mixed mid-script: providers handle code-switching poorly; keep single-language scripts
FAQ
Which TTS provider should I use in Wevion?
Wevion's Creative Hub offers two text-to-speech providers. elevenlabs delivers the highest fidelity, excellent multilingual support, voice cloning, and emotional tone control, but costs more and runs slower. openai_tts is faster and cheaper with stable English quality. A common pattern is using OpenAI TTS for iteration and ElevenLabs for the final voiceover.
Can I clone a voice in Wevion?
Yes, through ElevenLabs. Upload voice samples to the ElevenLabs UI using their consent and capture flow, let ElevenLabs create the cloned voice, then reference that new voice_id in Wevion. This lets you scale your spokesperson or brand voice across hundreds of ad variants without recording each line individually.
How do I generate multilingual voiceover?
Translate your script, then generate TTS in each language using an appropriate voice_id and the matching language code. In Wevion, picking a voice native to the target language gives a natural accent. You can then pair each audio track with the same video through compositing, producing localized ads from a single video asset.
How do I make TTS sound natural?
Write TTS-friendly text: keep sentences under about 20 words, use natural punctuation since commas and periods drive pauses, avoid all-caps because some providers shout it, and use phonetic spelling for tricky names. Test one or two sentences first to validate the voice and pronunciation before generating the full script, and iterate on punctuation and pacing.