📚Academy
likeone
online

Building Local RAG Systems.

Retrieval-Augmented Generation -- give your local AI knowledge of your private documents without retraining the model.

After this lesson you'll know

  • How RAG works and why it's better than fine-tuning for most use cases
  • Building a complete local RAG pipeline from documents to answers
  • Prompt engineering for RAG -- reducing hallucination and improving accuracy
  • Evaluating and improving RAG quality over time

What Is RAG?

Retrieval-Augmented Generation combines two capabilities: retrieval (finding relevant documents from your collection) and generation (using an LLM to answer questions based on those documents). Instead of the model relying only on its training data, it references your actual documents to generate grounded answers.

Think of it as giving the AI an open-book exam. The model doesn't need to memorize your company's policies, financial data, or client records. It looks them up when asked, then formulates an answer based on what it found.

RAG vs. fine-tuning: Fine-tuning permanently alters a model's weights using your data. RAG leaves the model unchanged and retrieves context at query time. For most use cases -- especially with private, frequently updated data -- RAG is superior: faster to set up, easier to update, and the data source is transparent and auditable.

The RAG Pipeline

A local RAG system has five components. You built the first three in the previous lesson:

  1. Document ingestion: Load and chunk your documents
  2. Embedding: Convert chunks to vectors with a local model
  3. Vector storage: Store in ChromaDB or similar
  4. Retrieval: Find the most relevant chunks for a given question
  5. Generation: Pass the retrieved chunks to an LLM as context, generate an answer

Complete Local RAG System

import chromadb
import requests

def embed(text):
    r = requests.post("http://localhost:11434/api/embed",
        json={"model": "nomic-embed-text", "input": text})
    return r.json()["embeddings"][0]

def ask_llm(question, context_chunks):
    context = "\n\n---\n\n".join(context_chunks)
    prompt = f"""Answer the question based ONLY on the following context.
If the context doesn't contain the answer, say "I don't have
enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""
    r = requests.post("http://localhost:11434/api/generate",
        json={"model": "qwen2.5:14b", "prompt": prompt,
              "stream": False})
    return r.json()["response"]

# Setup (assuming documents already ingested)
client = chromadb.PersistentClient(path="./rag_db")
collection = client.get_collection("documents")

# RAG query
question = "What was our policy on remote work last quarter?"
results = collection.query(
    query_embeddings=[embed(question)], n_results=4)

answer = ask_llm(question, results["documents"][0])
print(answer)
🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai