Text-to-Speech Basics

The words are yours. The voice is AI. The craft is knowing which knobs to turn.

What You'll Learn

How modern TTS engines actually work under the hood
The top TTS platforms and when to use each one
How to write text that sounds natural when spoken
SSML and prosody controls for fine-tuning delivery

Foundation

How TTS Went From Robot to Real

Old TTS was concatenative — it stitched together tiny chunks of recorded speech. It sounded mechanical because it was mechanical. Modern TTS is neural. It learns the patterns of human speech from thousands of hours of recordings and generates audio from scratch, one waveform at a time.

The breakthrough models — Tacotron, VITS, VALL-E, and their descendants — don't just read words. They understand emphasis, rhythm, and the subtle musicality of natural conversation. The result is speech that can fool most listeners most of the time.

Tools

The TTS Landscape

ElevenLabs: The current quality leader. Exceptional emotional range, multilingual support, and the best voice cloning integration. Free tier is generous. Pro plans start at $5/month.

OpenAI TTS: Built into the API. Six voices, simple to use, great for developers building apps. Less customization than ElevenLabs but rock-solid reliability.

Google Cloud TTS: Enterprise-grade. Hundreds of voices across 40+ languages. WaveNet and Neural2 voices are excellent. Pay-per-character pricing keeps costs predictable.

Coqui / XTTS: Open-source option. Run it locally, no API costs, full control. Quality is slightly behind the commercial options but improving fast.

Edge TTS: Microsoft's free option via the Edge browser engine. Surprisingly good quality for zero cost. Great for prototyping.

Craft

Writing for the Ear

Text written for reading and text written for speaking are fundamentally different. Your eyes can re-read a complex sentence. Your ears get one shot. Good TTS input follows these principles:

Keep sentences short. Use contractions — "don't" sounds natural, "do not" sounds formal. Break long thoughts with punctuation. Use ellipses for pauses... and dashes for — emphasis shifts. Spell out numbers and abbreviations when you want consistent pronunciation.

SSML (Speech Synthesis Markup Language) gives you fine control. You can adjust rate, pitch, volume, add breaks, and specify pronunciation. Not every platform supports it, but when it's available, it's your best friend for polishing output.

SSML Quick Reference

<break time="500ms"/> — Insert a pause

<emphasis level="strong">word</emphasis> — Add emphasis

<prosody rate="slow">text</prosody> — Adjust speaking speed

<say-as interpret-as="date">2026-03-27</say-as> — Format interpretation

Try It: The Same Text, Three Ways

Take this paragraph and generate it on three different TTS platforms. Compare the results:

I didn't expect it to work. But when I pressed play and heard my own words come back to me in a voice that wasn't mine — a voice that somehow understood where to pause, where to push — I realized the game had changed.

Notice how each platform handles the em-dash, the emotional arc, and the final phrase. These differences define your tool choice for real projects.

Quick Review

TTS Platform Strengths

Text-to-Speech Tools

Tap one on the left, then its match on the right

Key Terms

TTS Vocabulary

Check Your Understanding

Lesson 2 Quiz

Text-to-Speech Basics — Console

⚡Write a prompt

Write a prompt asking AI to generate a TTS-optimized script. Include SSML-style hints for pauses, emphasis, pacing, and pronunciation guidance for tricky words.

▸

Type a prompt below to get started.

Try: