Voice Interfaces

The screen is a bottleneck. Voice removes it.

What You'll Learn

Designing voice-first user experiences that actually work
Building voice-controlled applications with modern APIs
Real-time speech-to-speech AI conversations
When voice is the right interface — and when it isn't

Why Voice

The Most Natural Interface

We speak 3-4 times faster than we type. We can speak while our hands are busy. We learned to talk before we learned to read. Voice is the interface humans were built for — screens are the workaround we've been stuck with.

The problem with voice interfaces has always been understanding. Siri, Alexa, and Google Assistant are impressive but brittle — they break the moment you go off-script. LLMs changed that equation. An AI that actually understands language makes voice interfaces that actually work.

Architecture

How Voice Apps Work

A modern voice interface has three layers that chain together in real time:

Ears (STT): Convert the user's speech to text. Whisper, Deepgram, or the Web Speech API. Latency matters here — users expect sub-second response. Deepgram's streaming API is the speed champion.

Brain (LLM): Process the text, understand intent, generate a response. Claude, GPT-4, or a local model. The brain decides what to say and can trigger actions — look up data, control devices, make API calls.

Mouth (TTS): Convert the response back to speech. ElevenLabs, OpenAI TTS, or Edge TTS. Voice quality and latency both matter. Users will tolerate a smart response that takes a second. They won't tolerate a robotic voice.

Speech-to-Speech (S2S): The newest paradigm skips the text layer entirely. OpenAI's Realtime API and similar models process audio in, audio out. Lower latency, more natural conversation flow, and the AI can use tone and inflection as input signals.

Design

Voice UX Principles

Be brief. Screen text can be scanned. Voice responses are linear — the user has to listen to every word. Keep responses under 30 seconds. If it's longer, offer to go deeper.

Confirm, don't assume. "I'll order the large pizza with mushrooms. Sound right?" Voice misinterpretation has real consequences. Build confirmation into critical actions.

Handle silence. Users pause to think. Good voice interfaces wait. Bad ones say "I didn't catch that" after two seconds of silence and destroy the conversation flow.

Provide escape hatches. "You can say 'start over' at any time." Voice-only interfaces can feel like a trap if users don't know how to navigate. Always offer a way out.

Code Example

Building a Voice Assistant in Python

Here is a complete voice loop in Python that listens, thinks, and speaks — the same architecture powering commercial voice assistants:

import speech_recognition as sr
from openai import OpenAI
import edge_tts
import asyncio
import subprocess

client = OpenAI()
recognizer = sr.Recognizer()

async def speak(text):
    """Convert text to speech using Edge TTS (free)."""
    communicate = edge_tts.Communicate(text, "en-US-JennyNeural")
    await communicate.save("response.mp3")
    subprocess.run(["afplay", "response.mp3"])  # macOS playback

def listen():
    """Capture speech from microphone and convert to text."""
    with sr.Microphone() as source:
        print("Listening...")
        recognizer.adjust_for_ambient_noise(source, duration=0.5)
        audio = recognizer.listen(source, timeout=10)

    try:
        text = recognizer.recognize_google(audio)  # Free Google STT
        print(f"You said: {text}")
        return text
    except sr.UnknownValueError:
        return None

def think(user_input, conversation_history):
    """Process input with an LLM and generate a response."""
    conversation_history.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a helpful voice assistant. Keep responses "
                       "under 3 sentences. Be conversational and warm."
        }] + conversation_history
    )

    reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    return reply

# Main voice loop
history = []
print("Voice assistant ready. Say 'goodbye' to exit.")
while True:
    user_text = listen()
    if user_text is None:
        continue
    if "goodbye" in user_text.lower():
        asyncio.run(speak("Goodbye! It was nice talking with you."))
        break
    response = think(user_text, history)
    asyncio.run(speak(response))

This runs on any Mac or Linux machine with a microphone. The architecture is identical to commercial assistants: ears (speech_recognition) capture audio, brain (GPT-4) processes it, mouth (Edge TTS) speaks the response. Swap components to upgrade — Deepgram for faster STT, ElevenLabs for better TTS, Claude for a different thinking style.

Deep Dive

Latency: The Make-or-Break Metric

Voice interfaces live or die on latency. In a normal conversation, the gap between one person finishing and another responding is about 200-400 milliseconds. If your voice app takes longer than 1.5 seconds to respond, users perceive it as broken. Here is where latency hides and how to crush it:

STT latency: Batch transcription (send audio, wait for full text) adds 1-3 seconds. Streaming transcription (send audio in real-time, get text as it arrives) reduces this to 100-300ms. Deepgram streaming is the fastest option. The Web Speech API in browsers is surprisingly good for prototypes.

LLM latency: The brain is usually the bottleneck. GPT-4 takes 2-5 seconds for a response. GPT-3.5-turbo or Claude Haiku respond in 0.5-1 second. Use streaming responses — start speaking the first sentence while the LLM is still generating the rest. This alone cuts perceived latency by 50%.

TTS latency: Cloud TTS adds 0.5-2 seconds for generation plus network round-trip. Edge TTS is faster than ElevenLabs for real-time applications. For the lowest latency, use browser-native speechSynthesis (low quality but instant) or cache common responses as pre-generated audio files.

The streaming trick: The best voice apps stream everything. STT streams partial transcripts to the LLM. The LLM streams tokens to TTS. TTS streams audio chunks to the speaker. Nothing waits for anything else to finish. This pipelining approach gets total response time under 1 second even with cloud services.

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.