The RAG Loop

Follow a query through the complete RAG pipeline — from user question to AI-generated answer grounded in your data. This is the heartbeat of every RAG system: embed, search, retrieve, augment, generate. Understanding each step is essential to building systems that give accurate answers instead of confident hallucinations.

RAG in One Sentence

Instead of hoping the LLM memorized the answer during training, we find the relevant documents and hand them to the LLM along with the question. The model answers using actual data, not guesswork. This is Retrieval-Augmented Generation — and it is the difference between an AI that says "I think the refund policy is 30 days" (hallucination) and one that says "According to your documentation, the refund window is 14 days from purchase" (grounded answer).

The Six Steps

Every RAG system — from a prototype to an enterprise deployment — follows these six steps in the same order:

Step 1: User Query

A natural language question enters the system. "What is the refund policy for international orders?" This is the starting point of every RAG loop. The quality of the query directly affects retrieval quality — vague queries get vague results.

Step 2: Embed Query

The question is converted to a vector using the same embedding model that processed the documents. This is critical — the query vector and document vectors must live in the same semantic space for similarity scores to be meaningful. Different models = different spaces = meaningless comparisons.

Step 3: Vector Search

The query vector is compared against all stored document vectors using cosine similarity. The HNSW index makes this fast — milliseconds even across millions of chunks. The database returns candidates ranked by how close they are to the query in semantic space.

Step 4: Retrieve Chunks

The top-K most similar chunks are fetched — typically 3 to 5. Each chunk comes with its text content, similarity score, and metadata (source document, section, date). The similarity threshold (usually 0.7-0.85) filters out low-relevance noise.

Step 5: Augment Prompt

The retrieved chunks are inserted into a prompt template alongside the original question. The template tells the LLM: "Here is context from our documentation. Answer based ONLY on this context." This is the "A" in RAG — and it is what prevents hallucination.

Step 6: LLM Response

The model generates an answer grounded in the retrieved context, not in potentially outdated training data. With temperature set low (0.0-0.2), the model sticks closely to the context, producing reliable, verifiable answers.

The Complete Pipeline in Code

Here is a full RAG loop implementation using OpenAI embeddings, Supabase pgvector, and Claude for generation:

from openai import OpenAI
from supabase import create_client
import anthropic

openai_client = OpenAI()
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
claude = anthropic.Anthropic()

def rag_query(question, top_k=5, threshold=0.75):
    """Complete RAG pipeline: embed → search → retrieve → augment → generate."""

    # Step 1: User question is already our input

    # Step 2: Embed the query
    query_embedding = openai_client.embeddings.create(
        input=question,
        model="text-embedding-3-small"
    ).data[0].embedding

    # Steps 3-4: Vector search + retrieve chunks
    result = supabase.rpc("match_documents", {
        "query_embedding": query_embedding,
        "match_threshold": threshold,
        "match_count": top_k
    }).execute()

    chunks = result.data
    if not chunks:
        return "I don't have enough information to answer that question."

    # Step 5: Augment prompt with retrieved context
    context = "\n\n---\n\n".join([
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
        for c in chunks
    ])

    # Step 6: Generate grounded answer
    response = claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="Answer based ONLY on the provided context. If the context "
               "doesn't contain the answer, say 'I don't have that information.' "
               "Cite the source for each claim.",
        messages=[{
            "role": "user",
            "content": f"""Context:
{context}

Question: {question}"""
        }]
    )
    return response.content[0].text

# Use it
answer = rag_query("What is the refund policy for international orders?")
print(answer)

This 40-line function is a complete, production-ready RAG pipeline. Every enterprise RAG system — no matter how complex — is built on this same pattern.

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.