RAG (Retrieval-Augmented Generation) is a technique that combines information retrieval with AI text generation. Instead of relying only on training data, RAG first searches a knowledge base for relevant documents, then includes those documents as context when generating an answer. This produces responses grounded in your specific data rather than the model's general knowledge.

How much does it cost to build a RAG system with Claude?

The infrastructure cost can be zero if you use local embeddings (Ollama) and SQLite for vector storage. Claude API costs depend on usage: a typical RAG query with 5 retrieved chunks costs roughly $0.01-0.05 with Sonnet 4.6. For high-volume applications, prompt caching reduces repeated system prompt costs by 90%.

What is the best embedding model for RAG with Claude?

For local deployment, mxbai-embed-large (1024 dimensions) provides excellent quality at 37ms per query. For API-based embeddings, Voyage embeddings are designed for use with Claude. OpenAI's text-embedding-3-small is another reliable option. The quality differences between models are smaller than the impact of good chunking and retrieval design.

How many chunks should I retrieve in a RAG query?

Start with 3-5 chunks per query. More chunks provide more context but increase cost and can overwhelm the model with irrelevant information. The optimal number depends on your chunk size and document complexity. Monitor answer quality and adjust — if Claude frequently says it lacks information, increase the count. If answers include irrelevant details, decrease it.

What is hybrid search in RAG?

Hybrid search combines vector search (semantic similarity) with keyword search (BM25/exact matching) using Reciprocal Rank Fusion. Vector search finds semantically related content while keyword search catches exact terms. Together they provide better retrieval than either method alone, handling both conceptual queries and specific keyword lookups.

SQLite or Pinecone for RAG vector storage?

SQLite with sqlite-vec handles up to 100,000 vectors with sub-50ms search on a single machine. Use it for local-first applications, prototypes, and moderate-scale production systems. Pinecone is better for millions of vectors, multi-tenant applications, and teams that want managed infrastructure. Start with SQLite and migrate when you hit performance limits.

How do I prevent hallucination in RAG?

Three defenses: set a maximum distance threshold on vector search so irrelevant results are excluded, instruct Claude in the system prompt to answer only from provided context, and always include source attribution so users can verify answers. A distance threshold of 0.75 cosine distance eliminates most noise while keeping relevant results.

Can I use RAG with Claude Code?

Yes. Claude Code uses CLAUDE.md files and project context as a form of RAG — it retrieves relevant project files and instructions before generating code. For more sophisticated RAG, build an MCP server that exposes your knowledge base as a tool, and Claude Code can query it during development sessions.

What is the difference between RAG and fine-tuning?

RAG retrieves external information at query time without changing the model. Fine-tuning permanently adjusts model weights on your data. RAG is better for knowledge that changes frequently, requires source attribution, or involves private data you do not want in model weights. Fine-tuning is better for teaching the model a specific style, format, or specialized capability. Claude does not currently support fine-tuning, making RAG the primary approach for customization.

How do I evaluate RAG system quality?

Measure retrieval recall (does the system find the right documents), answer accuracy (does Claude's answer match ground truth), latency (total query time under 3 seconds for interactive use), and token efficiency (are retrieved chunks relevant or wasteful). Build a test set of question-answer pairs with known sources and run automated evaluation weekly.

How to Build a RAG System with Claude

Build retrieval-augmented generation with Claude. Chunking, embeddings, vector search, and production patterns with working code.

Retrieval-augmented generation (RAG) is how you give Claude access to knowledge it was not trained on — your documents, your data, your institutional memory. Instead of cramming everything into the context window and hoping for the best, RAG retrieves the most relevant pieces of information for each query and feeds them to Claude alongside the question. The result is answers grounded in your actual data, not Claude's training set.

We run a production RAG system at Like One that powers our sovereign brain — over 7,000 vectors across 11 collections, serving hybrid search queries in under 50 milliseconds. This is not a demo. This is the architecture we built, broke, rebuilt, and now rely on for every session, every search, every decision. The patterns in this guide are extracted from that system.

If you are new to the Claude API, read our API guide first. If you understand API calls and want to give Claude memory, keep reading.

What RAG Actually Does

RAG solves a fundamental limitation: Claude knows what it was trained on, but it does not know your data. Your company's internal documents, your customer records, your product specifications, your legal contracts — none of this exists in Claude's training data. Without RAG, you have two bad options: paste everything into the prompt (expensive, limited by context window) or accept that Claude will hallucinate when asked about your specific information.

RAG introduces a retrieval step before generation. When a user asks a question, you first search your knowledge base for relevant documents, then include those documents in the prompt alongside the question. Claude generates its answer using both its training knowledge and the retrieved context. The architecture looks like this:

Index: Split your documents into chunks, generate embeddings for each chunk, and store them in a vector database.
Retrieve: When a query arrives, embed the query, search the vector database for the most similar chunks, and return the top results.
Generate: Send the query plus the retrieved chunks to Claude as context, and let Claude synthesize an answer from the relevant information.

Each step has tradeoffs that affect answer quality, latency, and cost. The rest of this guide walks through each step with the decisions and code that make them work in production.

Step 1: Chunking Your Documents

Chunking is where most RAG systems succeed or fail. The goal is to split documents into pieces that are small enough to be specific but large enough to be self-contained. Bad chunking produces bad retrieval, and no amount of model capability compensates for retrieving the wrong information.

Chunking Strategies

Fixed-size chunking splits documents into chunks of a fixed token or character count with overlap between consecutive chunks. Simple to implement, works surprisingly well for uniform documents like API documentation or knowledge base articles.

def fixed_size_chunks(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

Semantic chunking splits at natural boundaries — paragraphs, sections, headings. This preserves the document's logical structure and produces chunks that are more self-contained. More complex to implement but produces better retrieval for structured documents.

import re

def semantic_chunks(text, max_tokens=500):
    # Split on double newlines (paragraphs) or headings
    sections = re.split(r'\n\n|(?=^#{1,3}\s)', text, flags=re.MULTILINE)
    
    chunks = []
    current = []
    current_len = 0
    
    for section in sections:
        section_len = len(section.split())
        if current_len + section_len > max_tokens and current:
            chunks.append('\n\n'.join(current))
            current = [section]
            current_len = section_len
        else:
            current.append(section)
            current_len += section_len
    
    if current:
        chunks.append('\n\n'.join(current))
    
    return chunks

Recursive chunking tries multiple split strategies in order — first by section, then by paragraph, then by sentence, then by word — until each chunk fits within the size limit. This is the approach used by LangChain's RecursiveCharacterTextSplitter and works well as a general-purpose default.

How We Chunk

In our production system, we use semantic chunking with a 500-token target and metadata enrichment. Each chunk carries its source document ID, position index, and a condensed summary of the parent document's title and category. This metadata is critical for reranking and attribution — when Claude cites a source, we can trace it back to the exact document and section.

Key rules we learned the hard way:

Overlap matters. Without overlap, you lose context that spans chunk boundaries. A 50-token overlap (10% of chunk size) catches most boundary-spanning information without excessive redundancy.
Chunk size depends on your content. Technical documentation works well at 300-500 tokens. Narrative content (meeting notes, emails) needs 500-800 tokens to preserve context. Legal documents need 800-1200 tokens because clauses reference each other across paragraphs.
Never split code blocks. If a chunk contains a code example, keep the entire block together even if it exceeds your target size. A partial code snippet is worse than a slightly oversized chunk.
Preserve headings. Prepend the section heading to every chunk from that section. A chunk that says "Use the --force flag" is useless without knowing it is from the "Git Reset" section.

Step 2: Generating Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, which is what makes vector search work — you find relevant chunks by finding vectors that are close to the query vector in high-dimensional space.

Choosing an Embedding Model

You have several options for embedding models. The choice depends on your requirements for quality, speed, cost, and privacy:

Anthropic's Voyage embeddings: High-quality embeddings designed for use with Claude. Available via API. Good default if you are already using the Anthropic ecosystem.
OpenAI's text-embedding-3-small/large: Widely used, well-documented, API-based. Good quality at reasonable cost.
Local models (mxbai-embed-large, nomic-embed-text): Run on your own hardware via Ollama or similar. Zero API cost, full privacy, sub-50ms latency. We use mxbai-embed-large (1024 dimensions) and get 37ms per query on an M3 Mac.

# Using a local model via Ollama
import requests

def embed(text, model="mxbai-embed-large"):
    response = requests.post(
        "http://localhost:11434/api/embed",
        json={"model": model, "input": text}
    )
    return response.json()["embeddings"][0]

# Embed a chunk
vector = embed("RAG systems retrieve relevant documents before generating.")
print(f"Dimensions: {len(vector)}")  # 1024 for mxbai-embed-large

# Using Voyage via API
import voyageai

client = voyageai.Client()
result = client.embed(
    ["RAG systems retrieve relevant documents."],
    model="voyage-3",
    input_type="document"
)
vector = result.embeddings[0]

For most production systems, we recommend starting with a local model. The quality difference between local and API models is small for RAG retrieval, and the latency and cost advantages of local inference are significant at scale. You can always upgrade to an API model later if retrieval quality is insufficient.

Embedding Pipeline

An embedding pipeline processes new and updated documents automatically. In our system, a cron job runs every 5 minutes, checks for new brain entries, generates embeddings, and stores them in SQLite with the sqlite-vec extension. The entire pipeline processes a new entry in under 100ms.

import sqlite3
import json

def embed_and_store(db_path, entries):
    db = sqlite3.connect(db_path)
    
    for entry in entries:
        # Generate embedding
        vector = embed(entry["text"])
        
        # Store in sqlite-vec
        db.execute(
            "INSERT INTO vec_search(rowid, embedding) VALUES (?, ?)",
            (entry["id"], json.dumps(vector))
        )
        
        # Store metadata
        db.execute(
            "INSERT INTO vec_meta(rowid, key, collection, text) VALUES (?, ?, ?, ?)",
            (entry["id"], entry["key"], entry["collection"], entry["text"])
        )
    
    db.commit()
    db.close()

The pipeline should be idempotent — running it twice on the same document produces the same result. Check for existing embeddings before generating new ones, and update rather than duplicate when documents change.

Step 3: Vector Storage

You need somewhere to store embeddings and search them efficiently. The options range from simple files to dedicated vector databases:

SQLite + sqlite-vec: Our choice. Zero infrastructure, embedded in your application, handles tens of thousands of vectors with sub-50ms search. Perfect for single-machine deployments and local-first applications.
ChromaDB: Python-native vector database. Easy setup, good for prototyping. We used it early on but migrated to sqlite-vec for reliability and simplicity.
Pinecone / Weaviate / Qdrant: Managed vector databases for scale. Use these when you have millions of vectors, need multi-tenant isolation, or require distributed search across regions.
pgvector (PostgreSQL): If you already run PostgreSQL, pgvector adds vector search without a separate database. Good for applications that want to keep everything in one data store.

SQLite-vec Setup

import sqlite3

def setup_vec_db(db_path):
    db = sqlite3.connect(db_path)
    db.enable_load_extension(True)
    db.load_extension("vec0")  # Load sqlite-vec
    
    # Create virtual table for vector search
    db.execute("""
        CREATE VIRTUAL TABLE IF NOT EXISTS vec_search
        USING vec0(
            embedding float[1024]  -- Match your model's dimensions
        )
    """)
    
    # Metadata table for chunk content and source tracking
    db.execute("""
        CREATE TABLE IF NOT EXISTS vec_meta (
            rowid INTEGER PRIMARY KEY,
            key TEXT,
            collection TEXT,
            text TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    
    db.commit()
    return db

SQLite-vec uses exact nearest-neighbor search by default, which is fine for up to ~100,000 vectors. Beyond that, consider approximate search (ANN) via Pinecone or Qdrant, which trade a small amount of accuracy for massive speed gains at scale.

Step 4: Retrieval

Retrieval is the core of RAG. Given a user query, find the most relevant chunks from your knowledge base. The quality of retrieval directly determines the quality of Claude's answers — retrieve the right context and Claude shines; retrieve the wrong context and Claude confidently generates incorrect answers from irrelevant information.

Vector Search (Semantic)

def vector_search(db, query, top_k=5, max_distance=0.75):
    query_vec = embed(query)
    
    results = db.execute("""
        SELECT m.key, m.text, v.distance
        FROM vec_search v
        JOIN vec_meta m ON m.rowid = v.rowid
        WHERE v.embedding MATCH ?
        AND k = ?
        AND v.distance < ?
        ORDER BY v.distance
    """, (json.dumps(query_vec), top_k, max_distance)).fetchall()
    
    return [{"key": r[0], "text": r[1], "score": 1 - r[2]} for r in results]

The max_distance threshold is critical. Without it, vector search always returns results even when nothing is relevant — the query "what is the weather" will return your closest chunk about API configuration because something is always closest. We use 0.75 as our cosine distance threshold, which eliminates noise while keeping genuinely relevant results.

Keyword Search (BM25)

Vector search finds semantically similar content, but sometimes users search for exact terms. BM25 (the algorithm behind traditional search engines) complements vector search by matching exact keywords. SQLite's FTS5 extension provides BM25 scoring out of the box.

def keyword_search(db, query, top_k=5):
    results = db.execute("""
        SELECT key, text, bm25(content_fts) as score
        FROM content_fts
        WHERE content_fts MATCH ?
        ORDER BY score
        LIMIT ?
    """, (query, top_k)).fetchall()
    
    return [{"key": r[0], "text": r[1], "score": -r[2]} for r in results]

Hybrid Search (Best of Both)

The best RAG systems combine vector and keyword search. Hybrid search uses Reciprocal Rank Fusion (RRF) to merge results from both approaches, giving you semantic understanding and exact keyword matching in a single query.

def hybrid_search(db, query, top_k=5, vec_weight=0.7, bm25_weight=0.3):
    vec_results = vector_search(db, query, top_k=top_k * 2)
    bm25_results = keyword_search(db, query, top_k=top_k * 2)
    
    # Reciprocal Rank Fusion
    scores = {}
    for rank, r in enumerate(vec_results):
        key = r["key"]
        scores[key] = scores.get(key, 0) + vec_weight / (rank + 60)
    
    for rank, r in enumerate(bm25_results):
        key = r["key"]
        scores[key] = scores.get(key, 0) + bm25_weight / (rank + 60)
    
    # Sort by combined score and return top_k
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    
    # Fetch full text for top results
    results = []
    for key, score in ranked:
        text = next((r["text"] for r in vec_results + bm25_results if r["key"] == key), "")
        results.append({"key": key, "text": text, "score": score})
    
    return results

Our production system uses hybrid search with a 0.7/0.3 vector/BM25 split. Vector search handles the semantic heavy lifting while BM25 catches exact-match queries that vector search sometimes misses. Average query time: 45ms including embedding generation.

Reranking

For high-stakes queries where retrieval quality matters more than speed, add a reranking step. After initial retrieval returns the top candidates, a reranking model rescores them for relevance to the specific query. This catches cases where the initial retrieval surfaces the right documents but in the wrong order.

def rerank(query, results, model="llama3.2:3b"):
    """Use a small LLM to rerank results by relevance."""
    prompt = f"""Rate the relevance of each document to the query.
    Query: {query}
    
    Documents:
    {chr(10).join(f'{i+1}. {r["text"][:200]}' for i, r in enumerate(results))}
    
    Return a JSON array of document numbers ordered by relevance."""
    
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    # Parse and reorder results based on LLM ranking
    return reordered_results

Reranking adds ~2-3 seconds of latency, so use it selectively. We enable reranking for pre-session context generation (where quality matters and latency is acceptable) and disable it for interactive search (where speed matters more).

Step 5: Generation with Retrieved Context

The final step is sending the retrieved context to Claude alongside the user's question. This is where RAG meets the Claude API.

import anthropic

def rag_query(question, knowledge_base_db):
    # Retrieve relevant chunks
    context = hybrid_search(knowledge_base_db, question, top_k=5)
    
    # Format context for Claude
    context_text = "\n\n".join([
        f"[Source: {c['key']}]\n{c['text']}" 
        for c in context
    ])
    
    # Generate answer with Claude
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="""You are a helpful assistant that answers questions based on 
        the provided context. If the context does not contain enough information 
        to answer the question, say so explicitly. Never make up information 
        that is not in the context.""",
        messages=[{
            "role": "user",
            "content": f"""Context:\n{context_text}\n\nQuestion: {question}"""
        }]
    )
    
    return {
        "answer": response.content[0].text,
        "sources": [c["key"] for c in context],
        "tokens": response.usage
    }

System Prompt Design for RAG

The system prompt is critical for RAG quality. It tells Claude how to use the retrieved context and what to do when the context is insufficient. Key elements:

Grounding instruction: Tell Claude to answer from the provided context, not its training data. This prevents hallucination about your specific data.
Citation instruction: Tell Claude to reference source documents. This lets users verify answers and builds trust in the system.
Insufficiency instruction: Tell Claude what to do when the context does not contain the answer. "I don't have enough information" is better than a fabricated answer.
Format instruction: Specify the output format — bullet points, paragraphs, JSON — to get consistent, parseable responses.

Our production system prompt for RAG queries:

system = """Answer questions using ONLY the provided context documents.

Rules:
- Base your answer entirely on the context provided
- If the context does not contain enough information, say "I don't have 
  enough information in my knowledge base to answer this fully"
- Cite sources using [Source: key] format
- Be concise — answer the question directly, then provide supporting detail
- Never invent statistics, dates, or facts not present in the context"""

Production Patterns

Contextual Compression

When retrieved chunks are long, compress them before sending to Claude. Extract only the sentences relevant to the query. This reduces token usage (cost) and focuses Claude's attention on the most relevant information.

def compress_context(query, chunks, model="claude-haiku-4-5"):
    """Use Haiku to extract relevant sentences from each chunk."""
    client = anthropic.Anthropic()
    compressed = []
    
    for chunk in chunks:
        response = client.messages.create(
            model=model,
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Extract only the sentences relevant to this 
                query from the document. Return just the relevant text.
                
                Query: {query}
                Document: {chunk['text']}"""
            }]
        )
        compressed.append({
            **chunk,
            "text": response.content[0].text
        })
    
    return compressed

Use Haiku for compression — it is fast and cheap, and the task is simple extraction, not complex reasoning. The cost of compression is almost always offset by the token savings in the main generation call.

Multi-Query Retrieval

A single user question might require information from multiple angles. Multi-query retrieval generates several search queries from the original question and combines the results. This catches relevant documents that a single query might miss.

def multi_query_retrieve(db, question, n_queries=3):
    client = anthropic.Anthropic()
    
    # Generate alternative queries
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Generate {n_queries} alternative search queries 
            for this question. Return one per line, no numbering.
            
            Question: {question}"""
        }]
    )
    
    queries = [question] + response.content[0].text.strip().split('\n')
    
    # Search with each query and deduplicate
    all_results = {}
    for q in queries:
        for r in hybrid_search(db, q, top_k=3):
            if r["key"] not in all_results:
                all_results[r["key"]] = r
    
    return list(all_results.values())[:5]

Metadata Filtering

Not all documents are equally relevant at all times. Metadata filtering narrows the search space before vector search runs, improving both speed and relevance. Filter by document type, date range, access level, or any other metadata dimension.

def filtered_search(db, query, collection=None, min_date=None, top_k=5):
    query_vec = embed(query)
    
    conditions = ["v.distance < 0.75"]
    params = [json.dumps(query_vec), top_k]
    
    if collection:
        conditions.append("m.collection = ?")
        params.append(collection)
    if min_date:
        conditions.append("m.created_at > ?")
        params.append(min_date)
    
    where = " AND ".join(conditions)
    
    results = db.execute(f"""
        SELECT m.key, m.text, v.distance
        FROM vec_search v
        JOIN vec_meta m ON m.rowid = v.rowid
        WHERE v.embedding MATCH ? AND k = ? AND {where}
        ORDER BY v.distance
    """, params).fetchall()
    
    return results

Evaluation

A RAG system is only as good as its retrieval quality. Measure it systematically:

Retrieval recall: What percentage of relevant documents are retrieved? Build a test set of question-answer pairs with known source documents and measure whether retrieval finds them.
Answer accuracy: Do Claude's answers match ground truth? Use automated evaluation with a stronger model (Opus) as judge.
Latency: Track end-to-end query time (embed + search + generate). Set budgets: interactive queries under 3 seconds, batch queries under 10 seconds.
Token efficiency: How many input tokens does each query consume? Monitor for context bloat — if retrieved chunks are too large or too numerous, you waste tokens on irrelevant content.

We log every RAG query — the question, retrieved chunks, Claude's answer, and response time. Weekly reviews of these logs surface patterns: queries that consistently retrieve poor results reveal gaps in your knowledge base; queries with high token counts reveal chunking issues; slow queries reveal embedding or search bottlenecks.

Common Pitfalls

Too many chunks. Retrieving 20 chunks overwhelms Claude with context and increases cost. Start with 3-5 chunks and increase only if answer quality is insufficient.
No distance threshold. Without a maximum distance, every query returns results — even when nothing relevant exists. Claude then generates confidently wrong answers from irrelevant context. Always set a threshold.
Stale embeddings. If your documents change but your embeddings do not, retrieval returns outdated information. Build an update pipeline that re-embeds modified documents automatically.
Ignoring BM25. Pure vector search misses exact keyword matches. A user searching for "error code E4012" needs keyword matching, not semantic similarity. Hybrid search catches both.
Chunking without overlap. Information that spans a chunk boundary is lost. Always include 10-15% overlap between consecutive chunks.
No source attribution. If users cannot verify where an answer came from, they cannot trust the system. Always return source references alongside answers.

Scaling RAG

Our SQLite-based system handles tens of thousands of vectors on a single machine. When you outgrow this:

10K-100K vectors: SQLite-vec with exact search. Sub-50ms queries. No infrastructure needed.
100K-1M vectors: pgvector or Qdrant with approximate nearest neighbor (ANN) indexing. Queries stay fast with HNSW or IVF indexes.
1M+ vectors: Managed services (Pinecone, Weaviate Cloud). Distributed search, automatic scaling, multi-region replication.

Do not over-architect for scale you do not have. Start with SQLite. Move to a managed database when (not before) you hit performance limits. Premature infrastructure is the most common waste in RAG projects — teams spend weeks setting up Kubernetes deployments for vector databases that hold 5,000 documents.

Need Help Building Your RAG System?

From architecture to production deployment — our consulting team builds RAG systems with Claude for organizations. We specialize in knowledge bases, persistent memory architectures, and hybrid search pipelines.

From RAG to Sovereign Memory

RAG gives Claude access to your documents. Persistent memory gives Claude the ability to learn and remember across sessions. The progression is natural: start with RAG for document retrieval, add memory for conversation history, add graph relationships for entity connections, and eventually build a system that gets smarter with every interaction.

This is the path we followed at Like One. Our brain started as a simple RAG system — embed documents, search, generate. It evolved into a sovereign memory architecture with 11 collections, hybrid search, graph relationships, and automatic embedding. Each layer built on the previous one. RAG is the foundation.

For the API fundamentals that RAG builds on, read our Claude API guide. For building agents that use RAG as a tool, see our Agent SDK tutorial. For connecting RAG to external data sources via MCP, start with our MCP server tutorial. And if you want to certify your Claude architecture skills including RAG, check our CCA exam prep guide.