Retrieval-augmented generation (RAG) is how you give Claude access to knowledge it was not trained on — your documents, your data, your institutional memory. Instead of cramming everything into the context window and hoping for the best, RAG retrieves the most relevant pieces of information for each query and feeds them to Claude alongside the question. The result is answers grounded in your actual data, not Claude's training set.
We run a production RAG system at Like One that powers our sovereign brain — over 7,000 vectors across 11 collections, serving hybrid search queries in under 50 milliseconds. This is not a demo. This is the architecture we built, broke, rebuilt, and now rely on for every session, every search, every decision. The patterns in this guide are extracted from that system.
If you are new to the Claude API, read our API guide first. If you understand API calls and want to give Claude memory, keep reading.
What RAG Actually Does
RAG solves a fundamental limitation: Claude knows what it was trained on, but it does not know your data. Your company's internal documents, your customer records, your product specifications, your legal contracts — none of this exists in Claude's training data. Without RAG, you have two bad options: paste everything into the prompt (expensive, limited by context window) or accept that Claude will hallucinate when asked about your specific information.
RAG introduces a retrieval step before generation. When a user asks a question, you first search your knowledge base for relevant documents, then include those documents in the prompt alongside the question. Claude generates its answer using both its training knowledge and the retrieved context. The architecture looks like this:
- Index: Split your documents into chunks, generate embeddings for each chunk, and store them in a vector database.
- Retrieve: When a query arrives, embed the query, search the vector database for the most similar chunks, and return the top results.
- Generate: Send the query plus the retrieved chunks to Claude as context, and let Claude synthesize an answer from the relevant information.
Each step has tradeoffs that affect answer quality, latency, and cost. The rest of this guide walks through each step with the decisions and code that make them work in production.
Step 1: Chunking Your Documents
Chunking is where most RAG systems succeed or fail. The goal is to split documents into pieces that are small enough to be specific but large enough to be self-contained. Bad chunking produces bad retrieval, and no amount of model capability compensates for retrieving the wrong information.
Chunking Strategies
Fixed-size chunking splits documents into chunks of a fixed token or character count with overlap between consecutive chunks. Simple to implement, works surprisingly well for uniform documents like API documentation or knowledge base articles.
def fixed_size_chunks(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
Semantic chunking splits at natural boundaries — paragraphs, sections, headings. This preserves the document's logical structure and produces chunks that are more self-contained. More complex to implement but produces better retrieval for structured documents.
import re
def semantic_chunks(text, max_tokens=500):
# Split on double newlines (paragraphs) or headings
sections = re.split(r'\n\n|(?=^#{1,3}\s)', text, flags=re.MULTILINE)
chunks = []
current = []
current_len = 0
for section in sections:
section_len = len(section.split())
if current_len + section_len > max_tokens and current:
chunks.append('\n\n'.join(current))
current = [section]
current_len = section_len
else:
current.append(section)
current_len += section_len
if current:
chunks.append('\n\n'.join(current))
return chunks
Recursive chunking tries multiple split strategies in order — first by section, then by paragraph, then by sentence, then by word — until each chunk fits within the size limit. This is the approach used by LangChain's RecursiveCharacterTextSplitter and works well as a general-purpose default.
How We Chunk
In our production system, we use semantic chunking with a 500-token target and metadata enrichment. Each chunk carries its source document ID, position index, and a condensed summary of the parent document's title and category. This metadata is critical for reranking and attribution — when Claude cites a source, we can trace it back to the exact document and section.
Key rules we learned the hard way:
- Overlap matters. Without overlap, you lose context that spans chunk boundaries. A 50-token overlap (10% of chunk size) catches most boundary-spanning information without excessive redundancy.
- Chunk size depends on your content. Technical documentation works well at 300-500 tokens. Narrative content (meeting notes, emails) needs 500-800 tokens to preserve context. Legal documents need 800-1200 tokens because clauses reference each other across paragraphs.
- Never split code blocks. If a chunk contains a code example, keep the entire block together even if it exceeds your target size. A partial code snippet is worse than a slightly oversized chunk.
- Preserve headings. Prepend the section heading to every chunk from that section. A chunk that says "Use the --force flag" is useless without knowing it is from the "Git Reset" section.
Step 2: Generating Embeddings
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, which is what makes vector search work — you find relevant chunks by finding vectors that are close to the query vector in high-dimensional space.
Choosing an Embedding Model
You have several options for embedding models. The choice depends on your requirements for quality, speed, cost, and privacy:
- Anthropic's Voyage embeddings: High-quality embeddings designed for use with Claude. Available via API. Good default if you are already using the Anthropic ecosystem.
- OpenAI's text-embedding-3-small/large: Widely used, well-documented, API-based. Good quality at reasonable cost.
- Local models (mxbai-embed-large, nomic-embed-text): Run on your own hardware via Ollama or similar. Zero API cost, full privacy, sub-50ms latency. We use mxbai-embed-large (1024 dimensions) and get 37ms per query on an M3 Mac.
# Using a local model via Ollama
import requests
def embed(text, model="mxbai-embed-large"):
response = requests.post(
"http://localhost:11434/api/embed",
json={"model": model, "input": text}
)
return response.json()["embeddings"][0]
# Embed a chunk
vector = embed("RAG systems retrieve relevant documents before generating.")
print(f"Dimensions: {len(vector)}") # 1024 for mxbai-embed-large
# Using Voyage via API
import voyageai
client = voyageai.Client()
result = client.embed(
["RAG systems retrieve relevant documents."],
model="voyage-3",
input_type="document"
)
vector = result.embeddings[0]
For most production systems, we recommend starting with a local model. The quality difference between local and API models is small for RAG retrieval, and the latency and cost advantages of local inference are significant at scale. You can always upgrade to an API model later if retrieval quality is insufficient.
Embedding Pipeline
An embedding pipeline processes new and updated documents automatically. In our system, a cron job runs every 5 minutes, checks for new brain entries, generates embeddings, and stores them in SQLite with the sqlite-vec extension. The entire pipeline processes a new entry in under 100ms.
import sqlite3
import json
def embed_and_store(db_path, entries):
db = sqlite3.connect(db_path)
for entry in entries:
# Generate embedding
vector = embed(entry["text"])
# Store in sqlite-vec
db.execute(
"INSERT INTO vec_search(rowid, embedding) VALUES (?, ?)",
(entry["id"], json.dumps(vector))
)
# Store metadata
db.execute(
"INSERT INTO vec_meta(rowid, key, collection, text) VALUES (?, ?, ?, ?)",
(entry["id"], entry["key"], entry["collection"], entry["text"])
)
db.commit()
db.close()
The pipeline should be idempotent — running it twice on the same document produces the same result. Check for existing embeddings before generating new ones, and update rather than duplicate when documents change.
Step 3: Vector Storage
You need somewhere to store embeddings and search them efficiently. The options range from simple files to dedicated vector databases:
- SQLite + sqlite-vec: Our choice. Zero infrastructure, embedded in your application, handles tens of thousands of vectors with sub-50ms search. Perfect for single-machine deployments and local-first applications.
- ChromaDB: Python-native vector database. Easy setup, good for prototyping. We used it early on but migrated to sqlite-vec for reliability and simplicity.
- Pinecone / Weaviate / Qdrant: Managed vector databases for scale. Use these when you have millions of vectors, need multi-tenant isolation, or require distributed search across regions.
- pgvector (PostgreSQL): If you already run PostgreSQL, pgvector adds vector search without a separate database. Good for applications that want to keep everything in one data store.
SQLite-vec Setup
import sqlite3
def setup_vec_db(db_path):
db = sqlite3.connect(db_path)
db.enable_load_extension(True)
db.load_extension("vec0") # Load sqlite-vec
# Create virtual table for vector search
db.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS vec_search
USING vec0(
embedding float[1024] -- Match your model's dimensions
)
""")
# Metadata table for chunk content and source tracking
db.execute("""
CREATE TABLE IF NOT EXISTS vec_meta (
rowid INTEGER PRIMARY KEY,
key TEXT,
collection TEXT,
text TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
db.commit()
return db
SQLite-vec uses exact nearest-neighbor search by default, which is fine for up to ~100,000 vectors. Beyond that, consider approximate search (ANN) via Pinecone or Qdrant, which trade a small amount of accuracy for massive speed gains at scale.
Step 4: Retrieval
Retrieval is the core of RAG. Given a user query, find the most relevant chunks from your knowledge base. The quality of retrieval directly determines the quality of Claude's answers — retrieve the right context and Claude shines; retrieve the wrong context and Claude confidently generates incorrect answers from irrelevant information.
Vector Search (Semantic)
def vector_search(db, query, top_k=5, max_distance=0.75):
query_vec = embed(query)
results = db.execute("""
SELECT m.key, m.text, v.distance
FROM vec_search v
JOIN vec_meta m ON m.rowid = v.rowid
WHERE v.embedding MATCH ?
AND k = ?
AND v.distance < ?
ORDER BY v.distance
""", (json.dumps(query_vec), top_k, max_distance)).fetchall()
return [{"key": r[0], "text": r[1], "score": 1 - r[2]} for r in results]
The max_distance threshold is critical. Without it, vector search always returns results even when nothing is relevant — the query "what is the weather" will return your closest chunk about API configuration because something is always closest. We use 0.75 as our cosine distance threshold, which eliminates noise while keeping genuinely relevant results.
Keyword Search (BM25)
Vector search finds semantically similar content, but sometimes users search for exact terms. BM25 (the algorithm behind traditional search engines) complements vector search by matching exact keywords. SQLite's FTS5 extension provides BM25 scoring out of the box.
def keyword_search(db, query, top_k=5):
results = db.execute("""
SELECT key, text, bm25(content_fts) as score
FROM content_fts
WHERE content_fts MATCH ?
ORDER BY score
LIMIT ?
""", (query, top_k)).fetchall()
return [{"key": r[0], "text": r[1], "score": -r[2]} for r in results]
Hybrid Search (Best of Both)
The best RAG systems combine vector and keyword search. Hybrid search uses Reciprocal Rank Fusion (RRF) to merge results from both approaches, giving you semantic understanding and exact keyword matching in a single query.
def hybrid_search(db, query, top_k=5, vec_weight=0.7, bm25_weight=0.3):
vec_results = vector_search(db, query, top_k=top_k * 2)
bm25_results = keyword_search(db, query, top_k=top_k * 2)
# Reciprocal Rank Fusion
scores = {}
for rank, r in enumerate(vec_results):
key = r["key"]
scores[key] = scores.get(key, 0) + vec_weight / (rank + 60)
for rank, r in enumerate(bm25_results):
key = r["key"]
scores[key] = scores.get(key, 0) + bm25_weight / (rank + 60)
# Sort by combined score and return top_k
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
# Fetch full text for top results
results = []
for key, score in ranked:
text = next((r["text"] for r in vec_results + bm25_results if r["key"] == key), "")
results.append({"key": key, "text": text, "score": score})
return results
Our production system uses hybrid search with a 0.7/0.3 vector/BM25 split. Vector search handles the semantic heavy lifting while BM25 catches exact-match queries that vector search sometimes misses. Average query time: 45ms including embedding generation.
Reranking
For high-stakes queries where retrieval quality matters more than speed, add a reranking step. After initial retrieval returns the top candidates, a reranking model rescores them for relevance to the specific query. This catches cases where the initial retrieval surfaces the right documents but in the wrong order.
def rerank(query, results, model="llama3.2:3b"):
"""Use a small LLM to rerank results by relevance."""
prompt = f"""Rate the relevance of each document to the query.
Query: {query}
Documents:
{chr(10).join(f'{i+1}. {r["text"][:200]}' for i, r in enumerate(results))}
Return a JSON array of document numbers ordered by relevance."""
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
# Parse and reorder results based on LLM ranking
return reordered_results
Reranking adds ~2-3 seconds of latency, so use it selectively. We enable reranking for pre-session context generation (where quality matters and latency is acceptable) and disable it for interactive search (where speed matters more).
Step 5: Generation with Retrieved Context
The final step is sending the retrieved context to Claude alongside the user's question. This is where RAG meets the Claude API.
import anthropic
def rag_query(question, knowledge_base_db):
# Retrieve relevant chunks
context = hybrid_search(knowledge_base_db, question, top_k=5)
# Format context for Claude
context_text = "\n\n".join([
f"[Source: {c['key']}]\n{c['text']}"
for c in context
])
# Generate answer with Claude
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system="""You are a helpful assistant that answers questions based on
the provided context. If the context does not contain enough information
to answer the question, say so explicitly. Never make up information
that is not in the context.""",
messages=[{
"role": "user",
"content": f"""Context:\n{context_text}\n\nQuestion: {question}"""
}]
)
return {
"answer": response.content[0].text,
"sources": [c["key"] for c in context],
"tokens": response.usage
}
System Prompt Design for RAG
The system prompt is critical for RAG quality. It tells Claude how to use the retrieved context and what to do when the context is insufficient. Key elements:
- Grounding instruction: Tell Claude to answer from the provided context, not its training data. This prevents hallucination about your specific data.
- Citation instruction: Tell Claude to reference source documents. This lets users verify answers and builds trust in the system.
- Insufficiency instruction: Tell Claude what to do when the context does not contain the answer. "I don't have enough information" is better than a fabricated answer.
- Format instruction: Specify the output format — bullet points, paragraphs, JSON — to get consistent, parseable responses.
Our production system prompt for RAG queries:
system = """Answer questions using ONLY the provided context documents.
Rules:
- Base your answer entirely on the context provided
- If the context does not contain enough information, say "I don't have
enough information in my knowledge base to answer this fully"
- Cite sources using [Source: key] format
- Be concise — answer the question directly, then provide supporting detail
- Never invent statistics, dates, or facts not present in the context"""
Production Patterns
Contextual Compression
When retrieved chunks are long, compress them before sending to Claude. Extract only the sentences relevant to the query. This reduces token usage (cost) and focuses Claude's attention on the most relevant information.
def compress_context(query, chunks, model="claude-haiku-4-5"):
"""Use Haiku to extract relevant sentences from each chunk."""
client = anthropic.Anthropic()
compressed = []
for chunk in chunks:
response = client.messages.create(
model=model,
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Extract only the sentences relevant to this
query from the document. Return just the relevant text.
Query: {query}
Document: {chunk['text']}"""
}]
)
compressed.append({
**chunk,
"text": response.content[0].text
})
return compressed
Use Haiku for compression — it is fast and cheap, and the task is simple extraction, not complex reasoning. The cost of compression is almost always offset by the token savings in the main generation call.
Multi-Query Retrieval
A single user question might require information from multiple angles. Multi-query retrieval generates several search queries from the original question and combines the results. This catches relevant documents that a single query might miss.
def multi_query_retrieve(db, question, n_queries=3):
client = anthropic.Anthropic()
# Generate alternative queries
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Generate {n_queries} alternative search queries
for this question. Return one per line, no numbering.
Question: {question}"""
}]
)
queries = [question] + response.content[0].text.strip().split('\n')
# Search with each query and deduplicate
all_results = {}
for q in queries:
for r in hybrid_search(db, q, top_k=3):
if r["key"] not in all_results:
all_results[r["key"]] = r
return list(all_results.values())[:5]
Metadata Filtering
Not all documents are equally relevant at all times. Metadata filtering narrows the search space before vector search runs, improving both speed and relevance. Filter by document type, date range, access level, or any other metadata dimension.
def filtered_search(db, query, collection=None, min_date=None, top_k=5):
query_vec = embed(query)
conditions = ["v.distance < 0.75"]
params = [json.dumps(query_vec), top_k]
if collection:
conditions.append("m.collection = ?")
params.append(collection)
if min_date:
conditions.append("m.created_at > ?")
params.append(min_date)
where = " AND ".join(conditions)
results = db.execute(f"""
SELECT m.key, m.text, v.distance
FROM vec_search v
JOIN vec_meta m ON m.rowid = v.rowid
WHERE v.embedding MATCH ? AND k = ? AND {where}
ORDER BY v.distance
""", params).fetchall()
return results
Evaluation
A RAG system is only as good as its retrieval quality. Measure it systematically:
- Retrieval recall: What percentage of relevant documents are retrieved? Build a test set of question-answer pairs with known source documents and measure whether retrieval finds them.
- Answer accuracy: Do Claude's answers match ground truth? Use automated evaluation with a stronger model (Opus) as judge.
- Latency: Track end-to-end query time (embed + search + generate). Set budgets: interactive queries under 3 seconds, batch queries under 10 seconds.
- Token efficiency: How many input tokens does each query consume? Monitor for context bloat — if retrieved chunks are too large or too numerous, you waste tokens on irrelevant content.
We log every RAG query — the question, retrieved chunks, Claude's answer, and response time. Weekly reviews of these logs surface patterns: queries that consistently retrieve poor results reveal gaps in your knowledge base; queries with high token counts reveal chunking issues; slow queries reveal embedding or search bottlenecks.
Common Pitfalls
- Too many chunks. Retrieving 20 chunks overwhelms Claude with context and increases cost. Start with 3-5 chunks and increase only if answer quality is insufficient.
- No distance threshold. Without a maximum distance, every query returns results — even when nothing relevant exists. Claude then generates confidently wrong answers from irrelevant context. Always set a threshold.
- Stale embeddings. If your documents change but your embeddings do not, retrieval returns outdated information. Build an update pipeline that re-embeds modified documents automatically.
- Ignoring BM25. Pure vector search misses exact keyword matches. A user searching for "error code E4012" needs keyword matching, not semantic similarity. Hybrid search catches both.
- Chunking without overlap. Information that spans a chunk boundary is lost. Always include 10-15% overlap between consecutive chunks.
- No source attribution. If users cannot verify where an answer came from, they cannot trust the system. Always return source references alongside answers.
Scaling RAG
Our SQLite-based system handles tens of thousands of vectors on a single machine. When you outgrow this:
- 10K-100K vectors: SQLite-vec with exact search. Sub-50ms queries. No infrastructure needed.
- 100K-1M vectors: pgvector or Qdrant with approximate nearest neighbor (ANN) indexing. Queries stay fast with HNSW or IVF indexes.
- 1M+ vectors: Managed services (Pinecone, Weaviate Cloud). Distributed search, automatic scaling, multi-region replication.
Do not over-architect for scale you do not have. Start with SQLite. Move to a managed database when (not before) you hit performance limits. Premature infrastructure is the most common waste in RAG projects — teams spend weeks setting up Kubernetes deployments for vector databases that hold 5,000 documents.
Need Help Building Your RAG System?
From architecture to production deployment — our consulting team builds RAG systems with Claude for organizations. We specialize in knowledge bases, persistent memory architectures, and hybrid search pipelines.
From RAG to Sovereign Memory
RAG gives Claude access to your documents. Persistent memory gives Claude the ability to learn and remember across sessions. The progression is natural: start with RAG for document retrieval, add memory for conversation history, add graph relationships for entity connections, and eventually build a system that gets smarter with every interaction.
This is the path we followed at Like One. Our brain started as a simple RAG system — embed documents, search, generate. It evolved into a sovereign memory architecture with 11 collections, hybrid search, graph relationships, and automatic embedding. Each layer built on the previous one. RAG is the foundation.
For the API fundamentals that RAG builds on, read our Claude API guide. For building agents that use RAG as a tool, see our Agent SDK tutorial. For connecting RAG to external data sources via MCP, start with our MCP server tutorial. And if you want to certify your Claude architecture skills including RAG, check our CCA exam prep guide.