Transcription & Analysis
Speech-to-text is solved. The real power is what you do with the words after.
What You'll Learn
- State-of-the-art transcription tools and when to use each
- Speaker diarization — who said what and when
- Extracting insights from conversations at scale
- Building searchable audio archives and knowledge bases
Transcription Is a Solved Problem
OpenAI's Whisper changed everything. Released as open-source in 2022, it achieved near-human accuracy across 99 languages. Suddenly, transcription that used to cost dollars per minute became essentially free. Every tool in this space either uses Whisper directly or competes with it.
Accuracy on clean audio in major languages is 95-99%. The remaining challenges are accents, overlapping speakers, domain-specific jargon, and noisy environments. Knowing which tool handles which edge case is the real skill.
The Transcription Stack
Whisper (local): Free. Run it on your machine. No data leaves your computer. Best for privacy-sensitive content. Slower than cloud options but you control everything.
Deepgram: Fastest cloud transcription. Real-time streaming support. Excellent speaker diarization. Their Nova-2 model rivals Whisper's accuracy at 10x the speed. Pay-per-minute pricing.
AssemblyAI: Best for analysis features beyond raw transcription. Sentiment analysis, topic detection, content moderation, PII redaction — all built in. Their Universal model handles challenging audio well.
Descript: Transcription plus editing in one interface. Edit audio by editing text. Remove filler words with a click. Best for content creators who need transcripts and polished audio simultaneously.
Audio Intelligence
Transcription is step one. The real value comes from what you extract:
Speaker Diarization: Identifying who spoke when. Critical for meetings, interviews, and multi-person recordings. Deepgram and AssemblyAI both handle this well. The output tags each segment with a speaker label.
Sentiment Analysis: Detecting emotional tone throughout a conversation. When did the mood shift? Where did frustration appear? Invaluable for customer call analysis, therapy research, and UX interviews.
Topic Extraction: Automatically identifying what was discussed and when. Turn a two-hour meeting into a structured summary with action items. Feed the transcript to Claude for deeper analysis — "What decisions were made? What questions went unanswered?"
Searchable Archives: Transcribe your entire audio library. Index it. Now you can search across hundreds of hours of recordings by keyword. Your meeting notes, podcast episodes, voice memos — all searchable in seconds.
Building a Transcription Pipeline in Python
Here is a complete pipeline that transcribes audio, identifies speakers, and generates a structured summary — all automated:
import whisper
from openai import OpenAI
# Step 1: Transcribe with Whisper (local, free, private)
model = whisper.load_model("base") # Options: tiny, base, small, medium, large
result = model.transcribe("meeting_recording.mp3")
transcript = result["text"]
# Step 2: Get timestamped segments
segments = result["segments"]
for seg in segments[:5]: # Preview first 5 segments
start = f"{seg['start']:.1f}s"
end = f"{seg['end']:.1f}s"
print(f"[{start} - {end}] {seg['text']}")
# Step 3: Analyze with Claude/GPT
client = OpenAI()
analysis = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"""Analyze this meeting transcript:
{transcript}
Provide:
1. Executive summary (3 sentences)
2. Key decisions made
3. Action items with owners (if identifiable)
4. Unresolved questions
5. Suggested follow-up topics"""
}]
)
print(analysis.choices[0].message.content)
# Step 4: Save structured output
with open("meeting_analysis.md", "w") as f:
f.write("# Meeting Analysis\n\n")
f.write(f"## Transcript\n{transcript}\n\n")
f.write(f"## Analysis\n{analysis.choices[0].message.content}")This pipeline runs entirely on your local machine (Whisper) plus one API call for analysis. For a 30-minute meeting, Whisper base model takes about 2-3 minutes on a modern laptop. The large model takes longer but handles accents and technical jargon significantly better.
Choosing the Right Transcription Tool
Each transcription tool has a specific sweet spot. The wrong choice wastes time or money:
Whisper (local) — Best when: privacy matters, budget is zero, you have time. Speed: 1-5x slower than real-time depending on model size. Accuracy: 95-99% on clean audio in major languages. Run the "large-v3" model for best accuracy or "tiny" for fast drafts. No internet required.
Deepgram Nova-2 — Best when: speed is critical, real-time streaming needed, production applications. Speed: faster than real-time. Accuracy: matches Whisper large model. Cost: $0.0043/minute. Unique: WebSocket streaming API, custom vocabulary for domain-specific terms.
AssemblyAI Universal — Best when: you need analysis beyond raw transcription. Speed: near real-time. Accuracy: competitive with Whisper large. Cost: $0.00025/second ($0.015/minute). Unique: built-in sentiment analysis, topic detection, PII redaction, content moderation — all in one API call.
Descript — Best when: you are editing audio/video content. Speed: fast. Accuracy: excellent. Cost: $24/month for Creator plan. Unique: transcript and audio are linked — edit text to edit audio. Not an API — it is an editing application.
Decision framework: If privacy matters most, use Whisper locally. If speed matters most, use Deepgram. If you need analysis on top of transcription, use AssemblyAI. If you are editing content, use Descript. If budget is zero and you need decent quality, use Whisper tiny or base model.
Building Searchable Audio Knowledge Bases
The real power of transcription is not individual files — it is what happens when you transcribe everything and make it searchable. Here is how to build an audio knowledge base:
Batch transcription: Write a script that watches a folder for new audio files and automatically transcribes them. Whisper handles this well locally. Deepgram's batch API handles it at scale in the cloud. Either way, every recording in your archive becomes searchable text.
Semantic indexing: Raw keyword search misses context. Use embeddings (OpenAI, HuggingFace BGE-small, or Cohere) to convert each transcript segment into a vector. Store these in a vector database (Supabase pgvector, Pinecone, or Chroma). Now you can search by meaning — "discussions about pricing strategy" finds relevant segments even if the word "pricing" was never spoken.
RAG over audio: Combine your searchable archive with an LLM. Ask questions like "What did we decide about the Q3 launch timeline across all meetings in March?" The system retrieves relevant transcript segments and synthesizes an answer. This turns hours of recordings into an answerable knowledge base.
Practical applications: Legal firms search depositions. Journalists search interview archives. Product teams search user research recordings. Sales teams search call recordings for objection patterns. Medical researchers search patient interviews. The use cases are everywhere once the infrastructure exists.
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.