Text-to-speech (TTS) voice generation

Last updated: May 19, 2026

Text-to-speech (TTS) voice generation

POST /api/v1/creative-hub/generate/tts (verified apps/backend/src/routes/api/creative-hub-generate.route.ts). Two providers verified in apps/backend/src/providers/creative/types.ts: elevenlabs (high quality, multilingual, voice cloning) and openai_tts (good baseline, faster + cheaper). Params: text, voice_id, language, plus provider-specific settings. Async via generate-tts.worker.ts. Output: audio file (MP3 typical) in your Drive folder.

Who is this for

Mediabuyers needing voice-over for video ads, multilingual narration, or accessibility audio. Especially used as input to compositing (ch-117).

The 2 providers

elevenlabs

Best for: premium voice quality, multilingual, voice cloning.

Strengths:

  • Highest fidelity (sounds most human)

  • Excellent multilingual support

  • Voice cloning available (record your spokesperson's voice → clone → generate any text in their voice)

  • Emotional tone control

Weaknesses:

  • More expensive

  • Slower per character

openai_tts

Best for: fast baseline TTS, cost-conscious workflows, English-first content.

Strengths:

  • Fast generation

  • Lower cost

  • Stable quality for English

Weaknesses:

  • Less natural than ElevenLabs in side-by-side

  • Limited multilingual range

How to generate

Step 1: Open the generator

/creative-hubAI GenerateTTS tab.

Step 2: Pick a provider

Dropdown. Default may be openai_tts or workspace-configured.

Step 3: Write the text

Text field. The narration script.

Best practices for TTS-friendly text:

  • Short sentences (under 20 words)

  • Natural punctuation (commas + periods drive pauses)

  • Avoid all-caps (some providers shout it)

  • Phonetic spelling for tricky names ("acme-AY-mee" if pronunciation differs from spelling)

Step 4: Pick voice_id

Each provider has its own voice library:

  • ElevenLabs: dozens of stock voices + cloned voices (voice_id is their identifier)

  • OpenAI TTS: handful of named voices (alloy, echo, fable, onyx, nova, shimmer)

Per-locale: pick a voice native to the target language for natural accent.

Step 5: Set language

Language code (en, it, es, fr, de, etc.). Affects pronunciation rules + intonation.

Step 6: Optional: provider-specific settings

For ElevenLabs:

  • Stability (lower = more variation; higher = more consistent)

  • Similarity boost (how closely to match the voice clone)

  • Style exaggeration

For OpenAI TTS:

  • Speed (0.5× to 2×)

Defaults work for most cases.

Step 7: Submit

Click Generate. Returns 202 Accepted + job_id.

Step 8: Track + download

TTS is fast: 5-30 sec typical.

Once completed: audio file (MP3) in your Drive folder. Use directly OR pair with video via compositing.

Endpoint

POST /api/v1/creative-hub/generate/tts (verified).

Body:

  • text (required)

  • provider (one of 2)

  • voice_id (required)

  • language (e.g. en, it)

  • settings (JSON, provider-specific)

  • folder_id (optional)

Returns 202 + job_id. Worker calls upstream provider, downloads audio, stores in Drive, marks completed.

Cost

TTS cost is generally low (charged per character or per token). ElevenLabs > OpenAI TTS.

For long-form content (1 min+ narration): cost adds up — consider OpenAI TTS for iteration, ElevenLabs for final.

See ch-112 AI credits.

Multilingual workflow

Same script translated → same voice_id (or per-locale matched voice) → N audio variants for N languages.

Common pattern:

  1. Write English script

  2. Translate via external tool (or human)

  3. Generate TTS in each language with appropriate voice_id + language

  4. Pair each audio with video (often the same video, multiple audio tracks via compositing)

Result: localized ads from a single video asset.

Voice cloning (ElevenLabs)

For a custom voice (your spokesperson, brand voice):

  1. Upload voice samples to ElevenLabs UI (their consent + capture flow)

  2. ElevenLabs creates a cloned voice

  3. Reference the new voice_id in Wevion

Allows scaling spokesperson voice across hundreds of ad variants without recording each one.

Best practices

Test short before long

Generate 1-2 sentences to validate voice + pronunciation. Then generate full script.

Use punctuation deliberately

Commas = short pause. Periods = full pause. Em-dashes — like this — = thoughtful pause. TTS respects these.

Match voice to ad mood

  • Energetic ads: faster voice with enthusiasm

  • Trust / authority ads: deeper voice with measured pace

  • Casual / friendly: lighter voice, conversational

Pair with compositing immediately

Audio alone has limited use. Pair with video via ch-117 compositing for ad-ready output.

Common mistakes

  • Generating long monologue: cut into shorter sections for better pacing + easier iteration

  • Wrong language for voice: voice trained on English speaking Italian = strange accent. Use locale-matched voice.

  • Treating TTS like talking-head video: TTS is audio only; you still need visuals (video or static image carousel)

  • Skipping iteration: first TTS generation rarely perfect; iterate on punctuation + pacing

Common issues

  • Mispronunciation of brand name: use phonetic spelling in the input

  • Tone too monotone: try ElevenLabs with higher style exaggeration; or OpenAI TTS with a more expressive voice

  • Audio cuts off mid-sentence: text length may have hit provider limit; split into multiple shorter generations

  • Languages mixed mid-script: providers handle code-switching poorly; keep single-language scripts

Related