Text-to-speech (TTS) voice generation

Last updated: May 19, 2026

Text-to-speech (TTS) voice generation

POST /api/v1/creative-hub/generate/tts (verified apps/backend/src/routes/api/creative-hub-generate.route.ts). Two providers verified in apps/backend/src/providers/creative/types.ts: elevenlabs (high quality, multilingual, voice cloning) and openai_tts (good baseline, faster + cheaper). Params: text, voice_id, language, plus provider-specific settings. Async via generate-tts.worker.ts. Output: audio file (MP3 typical) in your Drive folder.

Who is this for

Mediabuyers needing voice-over for video ads, multilingual narration, or accessibility audio. Especially used as input to compositing (ch-117).

The 2 providers

elevenlabs

Best for: premium voice quality, multilingual, voice cloning.

Strengths:

Highest fidelity (sounds most human)
Excellent multilingual support
Voice cloning available (record your spokesperson's voice → clone → generate any text in their voice)
Emotional tone control

Weaknesses:

More expensive
Slower per character

openai_tts

Best for: fast baseline TTS, cost-conscious workflows, English-first content.

Strengths:

Fast generation
Lower cost
Stable quality for English

Weaknesses:

Less natural than ElevenLabs in side-by-side
Limited multilingual range

How to generate

Step 1: Open the generator

/creative-hub → AI Generate → TTS tab.

Step 2: Pick a provider

Dropdown. Default may be openai_tts or workspace-configured.

Step 3: Write the text

Text field. The narration script.

Best practices for TTS-friendly text:

Short sentences (under 20 words)
Natural punctuation (commas + periods drive pauses)
Avoid all-caps (some providers shout it)
Phonetic spelling for tricky names ("acme-AY-mee" if pronunciation differs from spelling)

Step 4: Pick voice_id

Each provider has its own voice library:

ElevenLabs: dozens of stock voices + cloned voices (voice_id is their identifier)
OpenAI TTS: handful of named voices (alloy, echo, fable, onyx, nova, shimmer)

Per-locale: pick a voice native to the target language for natural accent.

Step 5: Set language

Language code (en, it, es, fr, de, etc.). Affects pronunciation rules + intonation.

Step 6: Optional: provider-specific settings

For ElevenLabs:

Stability (lower = more variation; higher = more consistent)
Similarity boost (how closely to match the voice clone)
Style exaggeration

For OpenAI TTS:

Speed (0.5× to 2×)

Defaults work for most cases.

Step 7: Submit

Click Generate. Returns 202 Accepted + job_id.

Step 8: Track + download

TTS is fast: 5-30 sec typical.

Once completed: audio file (MP3) in your Drive folder. Use directly OR pair with video via compositing.

Endpoint

POST /api/v1/creative-hub/generate/tts (verified).

Body:

text (required)
provider (one of 2)
voice_id (required)
language (e.g. en, it)
settings (JSON, provider-specific)
folder_id (optional)

Returns 202 + job_id. Worker calls upstream provider, downloads audio, stores in Drive, marks completed.

Cost

TTS cost is generally low (charged per character or per token). ElevenLabs > OpenAI TTS.

For long-form content (1 min+ narration): cost adds up — consider OpenAI TTS for iteration, ElevenLabs for final.

See ch-112 AI credits.

Multilingual workflow

Same script translated → same voice_id (or per-locale matched voice) → N audio variants for N languages.

Common pattern:

Write English script
Translate via external tool (or human)
Generate TTS in each language with appropriate voice_id + language
Pair each audio with video (often the same video, multiple audio tracks via compositing)

Result: localized ads from a single video asset.

Voice cloning (ElevenLabs)

For a custom voice (your spokesperson, brand voice):

Upload voice samples to ElevenLabs UI (their consent + capture flow)
ElevenLabs creates a cloned voice
Reference the new voice_id in Wevion

Allows scaling spokesperson voice across hundreds of ad variants without recording each one.

Best practices

Test short before long

Generate 1-2 sentences to validate voice + pronunciation. Then generate full script.

Use punctuation deliberately

Commas = short pause. Periods = full pause. Em-dashes — like this — = thoughtful pause. TTS respects these.

Match voice to ad mood

Energetic ads: faster voice with enthusiasm
Trust / authority ads: deeper voice with measured pace
Casual / friendly: lighter voice, conversational

Pair with compositing immediately

Audio alone has limited use. Pair with video via ch-117 compositing for ad-ready output.

Common mistakes

Generating long monologue: cut into shorter sections for better pacing + easier iteration
Wrong language for voice: voice trained on English speaking Italian = strange accent. Use locale-matched voice.
Treating TTS like talking-head video: TTS is audio only; you still need visuals (video or static image carousel)
Skipping iteration: first TTS generation rarely perfect; iterate on punctuation + pacing

Common issues

Mispronunciation of brand name: use phonetic spelling in the input
Tone too monotone: try ElevenLabs with higher style exaggeration; or OpenAI TTS with a more expressive voice
Audio cuts off mid-sentence: text length may have hit provider limit; split into multiple shorter generations
Languages mixed mid-script: providers handle code-switching poorly; keep single-language scripts

Video compositing — combine TTS + video
Create avatars — talking-head alternative
AI best practices — broader creative guidance

Text-to-speech (TTS) voice generation

Text-to-speech (TTS) voice generation

Who is this for

The 2 providers

elevenlabs

openai_tts

How to generate

Step 1: Open the generator

Step 2: Pick a provider

Step 3: Write the text

Step 4: Pick voice_id

Step 5: Set language

Step 6: Optional: provider-specific settings

Step 7: Submit

Step 8: Track + download

Endpoint

Cost

Multilingual workflow

Voice cloning (ElevenLabs)

Best practices

Test short before long

Use punctuation deliberately

Match voice to ad mood

Pair with compositing immediately

Common mistakes

Common issues

Related