Text-to-Speech Basics
The words are yours. The voice is AI. The craft is knowing which knobs to turn.
What You'll Learn
- How modern TTS engines actually work under the hood
- The top TTS platforms and when to use each one
- How to write text that sounds natural when spoken
- SSML and prosody controls for fine-tuning delivery
How TTS Went From Robot to Real
Old TTS was concatenative — it stitched together tiny chunks of recorded speech. It sounded mechanical because it was mechanical. Modern TTS is neural. It learns the patterns of human speech from thousands of hours of recordings and generates audio from scratch, one waveform at a time.
The breakthrough models — Tacotron, VITS, VALL-E, and their descendants — don't just read words. They understand emphasis, rhythm, and the subtle musicality of natural conversation. The result is speech that can fool most listeners most of the time.
The TTS Landscape
ElevenLabs: The current quality leader. Exceptional emotional range, multilingual support, and the best voice cloning integration. Free tier is generous. Pro plans start at $5/month.
OpenAI TTS: Built into the API. Six voices, simple to use, great for developers building apps. Less customization than ElevenLabs but rock-solid reliability.
Google Cloud TTS: Enterprise-grade. Hundreds of voices across 40+ languages. WaveNet and Neural2 voices are excellent. Pay-per-character pricing keeps costs predictable.
Coqui / XTTS: Open-source option. Run it locally, no API costs, full control. Quality is slightly behind the commercial options but improving fast.
Edge TTS: Microsoft's free option via the Edge browser engine. Surprisingly good quality for zero cost. Great for prototyping.
Writing for the Ear
Text written for reading and text written for speaking are fundamentally different. Your eyes can re-read a complex sentence. Your ears get one shot. Good TTS input follows these principles:
Keep sentences short. Use contractions — "don't" sounds natural, "do not" sounds formal. Break long thoughts with punctuation. Use ellipses for pauses... and dashes for — emphasis shifts. Spell out numbers and abbreviations when you want consistent pronunciation.
SSML (Speech Synthesis Markup Language) gives you fine control. You can adjust rate, pitch, volume, add breaks, and specify pronunciation. Not every platform supports it, but when it's available, it's your best friend for polishing output.
SSML Quick Reference
<break time="500ms"/> — Insert a pause
<emphasis level="strong">word</emphasis> — Add emphasis
<prosody rate="slow">text</prosody> — Adjust speaking speed
<say-as interpret-as="date">2026-03-27</say-as> — Format interpretation
Try It: The Same Text, Three Ways
Take this paragraph and generate it on three different TTS platforms. Compare the results:
I didn't expect it to work. But when I pressed play and heard my own words come back to me in a voice that wasn't mine — a voice that somehow understood where to pause, where to push — I realized the game had changed.Notice how each platform handles the em-dash, the emotional arc, and the final phrase. These differences define your tool choice for real projects.
TTS Platform Strengths
Text-to-Speech Tools
Tap one on the left, then its match on the right