Architecture Decisions
Choose boring infrastructure. Save your creativity for the product.
The models, APIs, and databases you pick on day one will either accelerate you or haunt you for years. Choose wisely.
What you'll learn
- How to choose between APIs, open-source models, and fine-tuning
- The cost/quality/speed triangle for AI infrastructure
- When to use RAG vs. fine-tuning vs. prompt engineering
- Building for model-agnosticism from day one
API vs. Open Source vs. Fine-Tuned
APIs (Claude, GPT, Gemini): Start here. Fastest time to market. Highest quality for general tasks. You pay per token but you ship in days, not months. The tradeoff: you're dependent on someone else's model, pricing, and uptime.
Open source (Llama, Mistral): Lower per-query cost at scale. Full control. But you own the infrastructure — hosting, scaling, monitoring. Don't go here until you have product-market fit and predictable traffic.
Fine-tuned models: Only when you have domain-specific data that general models can't match. Fine-tuning is expensive, requires clean data, and locks you to a specific model version. It's a phase 2 optimization, never a phase 1 choice.
The Cost/Quality/Speed Triangle
API (Claude/GPT): High quality, high speed, higher cost per query
Open Source (Llama): Good quality, moderate speed, low cost at scale (but high infra cost)
Fine-Tuned: Best quality for your domain, slow to set up, medium ongoing cost
Embeddings + RAG: Good quality with your data, fast queries, lowest cost
RAG vs. Fine-Tuning vs. Prompting
Prompt engineering is your first tool. A well-crafted system prompt with examples can handle 80% of use cases. It's free to iterate, instant to deploy, and easy to debug. Exhaust this before moving on.
RAG (Retrieval-Augmented Generation) is for when the model needs your data — product docs, knowledge bases, user history. Store your data as embeddings, retrieve relevant chunks at query time, and feed them to the model as context. This is the sweet spot for most products.
Fine-tuning is for when you need the model to behave differently at a fundamental level — a specific tone, a specialized vocabulary, a unique reasoning pattern. It's powerful but expensive and hard to iterate on.
Build Model-Agnostic From Day One
Never hard-wire your product to a single AI provider. Abstract your model calls behind a clean interface. Today you use Claude. Tomorrow GPT-5 drops and it's better for your use case. Next month an open-source model matches quality at a tenth of the cost.
Your architecture should let you swap models with a config change, not a rewrite. Store prompts as templates. Keep model-specific code in a thin adapter layer. Your business logic should never know or care which model generated the response.
The Starter Stack
For most AI products on day one: a frontend (Next.js, Svelte, or even a static site), a backend that handles auth and billing (Supabase, Firebase), an AI API (Claude or GPT), and a vector database for RAG (pgvector, Pinecone). That's it. Four components. Ship fast, optimize later.
Embedding Strategies for RAG
RAG is the most common architecture for AI products that need domain knowledge. But "just add RAG" hides a dozen decisions that determine whether your retrieval actually works.
Chunk size matters. Too small (50 tokens) and you lose context. Too large (2,000 tokens) and you dilute relevance. Start with 300-500 tokens with 50-token overlap between chunks. Test on 20 real queries and adjust. There is no universal optimal size — it depends entirely on your data.
Embedding model selection. OpenAI's text-embedding-3-small is cheap and good. Cohere's embed-v3 handles multilingual well. For free, open-source options, BGE-small runs on CPU and produces surprisingly good results. Don't overthink this choice on day one — embedding models are easily swappable.
Hybrid search. Pure vector search misses exact matches. Pure keyword search misses semantic meaning. The best RAG systems combine both — use pgvector for semantic similarity and full-text search for exact keyword matches, then merge and re-rank results. This catches queries that either approach alone would miss.
Metadata filtering. Store metadata alongside embeddings — document type, date, author, category. At query time, filter by metadata before doing vector similarity. A user asking about "Q3 revenue" shouldn't get results from Q1 documents, even if they're semantically similar.
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.