Text-to-speech (TTS) voice generation
Last updated: May 19, 2026
Text-to-speech (TTS) voice generation
POST /api/v1/creative-hub/generate/tts (verified apps/backend/src/routes/api/creative-hub-generate.route.ts). Two providers verified in apps/backend/src/providers/creative/types.ts: elevenlabs (high quality, multilingual, voice cloning) and openai_tts (good baseline, faster + cheaper). Params: text, voice_id, language, plus provider-specific settings. Async via generate-tts.worker.ts. Output: audio file (MP3 typical) in your Drive folder.
Who is this for
Mediabuyers needing voice-over for video ads, multilingual narration, or accessibility audio. Especially used as input to compositing (ch-117).
The 2 providers
elevenlabs
Best for: premium voice quality, multilingual, voice cloning.
Strengths:
Highest fidelity (sounds most human)
Excellent multilingual support
Voice cloning available (record your spokesperson's voice → clone → generate any text in their voice)
Emotional tone control
Weaknesses:
More expensive
Slower per character
openai_tts
Best for: fast baseline TTS, cost-conscious workflows, English-first content.
Strengths:
Fast generation
Lower cost
Stable quality for English
Weaknesses:
Less natural than ElevenLabs in side-by-side
Limited multilingual range
How to generate
Step 1: Open the generator
/creative-hub → AI Generate → TTS tab.
Step 2: Pick a provider
Dropdown. Default may be openai_tts or workspace-configured.
Step 3: Write the text
Text field. The narration script.
Best practices for TTS-friendly text:
Short sentences (under 20 words)
Natural punctuation (commas + periods drive pauses)
Avoid all-caps (some providers shout it)
Phonetic spelling for tricky names ("acme-AY-mee" if pronunciation differs from spelling)
Step 4: Pick voice_id
Each provider has its own voice library:
ElevenLabs: dozens of stock voices + cloned voices (
voice_idis their identifier)OpenAI TTS: handful of named voices (alloy, echo, fable, onyx, nova, shimmer)
Per-locale: pick a voice native to the target language for natural accent.
Step 5: Set language
Language code (en, it, es, fr, de, etc.). Affects pronunciation rules + intonation.
Step 6: Optional: provider-specific settings
For ElevenLabs:
Stability (lower = more variation; higher = more consistent)
Similarity boost (how closely to match the voice clone)
Style exaggeration
For OpenAI TTS:
Speed (0.5× to 2×)
Defaults work for most cases.
Step 7: Submit
Click Generate. Returns 202 Accepted + job_id.
Step 8: Track + download
TTS is fast: 5-30 sec typical.
Once completed: audio file (MP3) in your Drive folder. Use directly OR pair with video via compositing.
Endpoint
POST /api/v1/creative-hub/generate/tts (verified).
Body:
text(required)provider(one of 2)voice_id(required)language(e.g. en, it)settings(JSON, provider-specific)folder_id(optional)
Returns 202 + job_id. Worker calls upstream provider, downloads audio, stores in Drive, marks completed.
Cost
TTS cost is generally low (charged per character or per token). ElevenLabs > OpenAI TTS.
For long-form content (1 min+ narration): cost adds up — consider OpenAI TTS for iteration, ElevenLabs for final.
See ch-112 AI credits.
Multilingual workflow
Same script translated → same voice_id (or per-locale matched voice) → N audio variants for N languages.
Common pattern:
Write English script
Translate via external tool (or human)
Generate TTS in each language with appropriate
voice_id+languagePair each audio with video (often the same video, multiple audio tracks via compositing)
Result: localized ads from a single video asset.
Voice cloning (ElevenLabs)
For a custom voice (your spokesperson, brand voice):
Upload voice samples to ElevenLabs UI (their consent + capture flow)
ElevenLabs creates a cloned voice
Reference the new
voice_idin Wevion
Allows scaling spokesperson voice across hundreds of ad variants without recording each one.
Best practices
Test short before long
Generate 1-2 sentences to validate voice + pronunciation. Then generate full script.
Use punctuation deliberately
Commas = short pause. Periods = full pause. Em-dashes — like this — = thoughtful pause. TTS respects these.
Match voice to ad mood
Energetic ads: faster voice with enthusiasm
Trust / authority ads: deeper voice with measured pace
Casual / friendly: lighter voice, conversational
Pair with compositing immediately
Audio alone has limited use. Pair with video via ch-117 compositing for ad-ready output.
Common mistakes
Generating long monologue: cut into shorter sections for better pacing + easier iteration
Wrong language for voice: voice trained on English speaking Italian = strange accent. Use locale-matched voice.
Treating TTS like talking-head video: TTS is audio only; you still need visuals (video or static image carousel)
Skipping iteration: first TTS generation rarely perfect; iterate on punctuation + pacing
Common issues
Mispronunciation of brand name: use phonetic spelling in the input
Tone too monotone: try ElevenLabs with higher style exaggeration; or OpenAI TTS with a more expressive voice
Audio cuts off mid-sentence: text length may have hit provider limit; split into multiple shorter generations
Languages mixed mid-script: providers handle code-switching poorly; keep single-language scripts
Related
Video compositing — combine TTS + video
Create avatars — talking-head alternative
AI best practices — broader creative guidance