Text-to-speech (TTS) voice generation

POST /api/v1/creative-hub/generate/tts accepts elevenlabs / openai_tts. Body: text + voice_id + language + provider + drive_folder_id (no settings field). Audio output for compositing.

Written By Salvatore Sinigaglia

Last updated About 4 hours ago

POST /api/v1/creative-hub/generate/tts accepts elevenlabs / openai_tts. Body: text + voice_id + language + provider + drive_folder_id (no settings field). Audio output for compositing.

Text-to-speech (TTS) voice generation

POST /api/v1/creative-hub/generate/tts (verified apps/backend/src/routes/api/creative-hub-generate.route.ts). This route accepts elevenlabs (high quality, multilingual, voice cloning) and openai_tts (good baseline, faster + cheaper). (The provider catalog in apps/backend/src/providers/creative/types.ts also lists heygen as a TTS provider, used from the studio suite.) Body: text, voice_id, language, provider, drive_folder_id — there is no settings field. Async via the TTS worker. Output: an audio file in your Drive folder.

Who is this for

Mediabuyers needing voice-over for video ads, multilingual narration, or accessibility audio. Especially used as input to compositing (ch-117).

The 2 providers

elevenlabs

Best for: premium voice quality, multilingual, voice cloning.

Strengths:

  • Highest fidelity (sounds most human)
  • Excellent multilingual support
  • Voice cloning available (record your spokesperson's voice → clone → generate any text in their voice)
  • Emotional tone control

Weaknesses:

  • More expensive
  • Slower per character

openai_tts

Best for: fast baseline TTS, cost-conscious workflows, English-first content.

Strengths:

  • Fast generation
  • Lower cost
  • Stable quality for English

Weaknesses:

  • Less natural than ElevenLabs in side-by-side
  • Limited multilingual range

How to generate

Step 1: Open the generator

/creative-hubAI GenerateTTS tab.

Step 2: Pick a provider

Dropdown. Default may be openai_tts or workspace-configured.

Step 3: Write the text

Text field. The narration script.

Best practices for TTS-friendly text:

  • Short sentences (under 20 words)
  • Natural punctuation (commas + periods drive pauses)
  • Avoid all-caps (some providers shout it)
  • Phonetic spelling for tricky names ("acme-AY-mee" if pronunciation differs from spelling)

Step 4: Pick voice_id

Each provider has its own voice library:

  • ElevenLabs: dozens of stock voices + cloned voices (voice_id is their identifier)
  • OpenAI TTS: handful of named voices (alloy, echo, fable, onyx, nova, shimmer)

Per-locale: pick a voice native to the target language for natural accent.

Step 5: Set language

Language code (en, it, es, fr, de, etc.). Affects pronunciation rules + intonation.

Step 6: (voice + language only)

The /generate/tts route takes text, voice_id, language, and provider — it does not accept a fine-grained settings object (stability, similarity, speed). Pick the voice and language that best fit; for finer voice controls use the provider's own tooling or the studio suite.

Step 7: Submit

Click Generate. Returns 202 Accepted + job_id.

Step 8: Track + download

TTS is fast: 5-30 sec typical.

Once completed: audio file (MP3) in your Drive folder. Use directly OR pair with video via compositing.

Endpoint

POST /api/v1/creative-hub/generate/tts (verified).

Body:

  • text (required, 1-5000 chars)
  • provider (optional, elevenlabs | openai_tts)
  • voice_id (optional)
  • language (optional, e.g. en, it)
  • drive_folder_id (optional)

There is no settings field on this route. Returns 202 + job_id. Worker calls the upstream provider, downloads the audio, stores it in Drive, and marks the job completed. Poll GET /api/v1/creative-hub/generate/jobs/:id.

Cost

TTS cost is generally low (charged per character or per token). ElevenLabs > OpenAI TTS.

For long-form content (1 min+ narration): cost adds up — consider OpenAI TTS for iteration, ElevenLabs for final.

See ch-112 AI credits.

Multilingual workflow

Same script translated → same voice_id (or per-locale matched voice) → N audio variants for N languages.

Common pattern:

  1. Write English script
  2. Translate via external tool (or human)
  3. Generate TTS in each language with appropriate voice_id + language
  4. Pair each audio with video (often the same video, multiple audio tracks via compositing)

Result: localized ads from a single video asset.

Voice cloning (ElevenLabs)

For a custom voice (your spokesperson, brand voice):

  1. Upload voice samples to ElevenLabs UI (their consent + capture flow)
  2. ElevenLabs creates a cloned voice
  3. Reference the new voice_id in Wevion

Allows scaling spokesperson voice across hundreds of ad variants without recording each one.

Best practices

Test short before long

Generate 1-2 sentences to validate voice + pronunciation. Then generate full script.

Use punctuation deliberately

Commas = short pause. Periods = full pause. Em-dashes — like this — = thoughtful pause. TTS respects these.

Match voice to ad mood

  • Energetic ads: faster voice with enthusiasm
  • Trust / authority ads: deeper voice with measured pace
  • Casual / friendly: lighter voice, conversational

Pair with compositing immediately

Audio alone has limited use. Pair with video via ch-117 compositing for ad-ready output.

Common mistakes

  • Generating long monologue: cut into shorter sections for better pacing + easier iteration
  • Wrong language for voice: voice trained on English speaking Italian = strange accent. Use locale-matched voice.
  • Treating TTS like talking-head video: TTS is audio only; you still need visuals (video or static image carousel)
  • Skipping iteration: first TTS generation rarely perfect; iterate on punctuation + pacing

Common issues

  • Mispronunciation of brand name: use phonetic spelling in the input
  • Tone too monotone: try ElevenLabs with higher style exaggeration; or OpenAI TTS with a more expressive voice
  • Audio cuts off mid-sentence: text length may have hit provider limit; split into multiple shorter generations
  • Languages mixed mid-script: providers handle code-switching poorly; keep single-language scripts

FAQ

Which TTS provider should I use in Wevion?

Wevion's Creative Hub offers two text-to-speech providers. elevenlabs delivers the highest fidelity, excellent multilingual support, voice cloning, and emotional tone control, but costs more and runs slower. openai_tts is faster and cheaper with stable English quality. A common pattern is using OpenAI TTS for iteration and ElevenLabs for the final voiceover.

Can I clone a voice in Wevion?

Yes, through ElevenLabs. Upload voice samples to the ElevenLabs UI using their consent and capture flow, let ElevenLabs create the cloned voice, then reference that new voice_id in Wevion. This lets you scale your spokesperson or brand voice across hundreds of ad variants without recording each line individually.

How do I generate multilingual voiceover?

Translate your script, then generate TTS in each language using an appropriate voice_id and the matching language code. In Wevion, picking a voice native to the target language gives a natural accent. You can then pair each audio track with the same video through compositing, producing localized ads from a single video asset.

How do I make TTS sound natural?

Write TTS-friendly text: keep sentences under about 20 words, use natural punctuation since commas and periods drive pauses, avoid all-caps because some providers shout it, and use phonetic spelling for tricky names. Test one or two sentences first to validate the voice and pronunciation before generating the full script, and iterate on punctuation and pacing.