RAG Mastery Quiz
Test your knowledge across all 9 lessons. This assessment covers embeddings, chunking, the RAG pipeline, prompt augmentation, hybrid search, evaluation, and advanced patterns. A score of 80% or higher means you are ready to build production RAG systems.
What You Have Learned
Over 9 lessons, you have built a complete understanding of Retrieval-Augmented Generation:
Embeddings convert text to vectors. Vector databases store and search by meaning. Chunking splits documents into searchable pieces.
The RAG loop: embed → search → retrieve → augment → generate. You built a complete system from scratch with real code.
Prompt augmentation prevents hallucination. Hybrid search combines keyword precision with semantic understanding.
Evaluation metrics measure quality. Advanced patterns (Multi-Step, Self-RAG, RAG+Tools, Agentic) handle complex queries.
Critical Concepts Review
Before you take the final assessment, revisit the core ideas from each module. These are the concepts that separate someone who has heard of RAG from someone who can build and maintain a production system.
Embeddings are the translation layer between human language and machine-searchable space. Every sentence, paragraph, or document becomes a fixed-length vector of floating-point numbers, and the geometric relationships between those vectors encode meaning.
- What they are: Dense numerical vectors (typically 384 to 1536 dimensions) produced by neural networks trained with contrastive learning. The model learns to place semantically similar text close together and dissimilar text far apart.
- Cosine similarity: The standard metric for comparing embeddings. It measures the angle between two vectors, not their magnitude. Two vectors pointing in the same direction score close to 1.0 regardless of length. This matters because embedding models can produce vectors of different magnitudes for texts of different lengths, and you want to compare meaning, not word count.
- Model selection: The embedding model you choose determines the quality ceiling for your entire RAG system. Key factors include dimension count (higher dimensions capture more nuance but cost more storage), training data domain (a model trained on scientific papers will outperform a general model for scientific RAG), and the critical rule: you must use the same model for documents and queries. Mismatched models produce vectors in different spaces that cannot be meaningfully compared.
A vector database is not just storage. It is the search engine that makes RAG fast enough for real-time applications. Understanding how it works under the hood helps you tune performance and debug retrieval failures.
- HNSW indexes: Hierarchical Navigable Small World graphs are the dominant indexing strategy. They build a multi-layer graph where the top layers have few, widely-spaced nodes for fast coarse navigation, and the bottom layers have dense connections for precise nearest-neighbor search. The result is sub-millisecond search across millions of vectors — the difference between a usable product and one that times out.
- Storage patterns: Vectors are stored alongside metadata (source document, chunk position, timestamps, tags). This metadata enables filtering before or after similarity search. A well-designed metadata schema is as important as the vectors themselves because it determines what kinds of filtered queries you can run efficiently.
- Similarity thresholds: Not every result above 0.0 is useful. Production systems set a minimum similarity threshold (commonly 0.7 to 0.85 for cosine similarity) below which results are discarded. Setting this threshold requires experimentation with your specific data — too high and you miss relevant results, too low and you inject noise into the LLM context.
RAG is a five-stage pipeline, and understanding each stage is essential for debugging when things go wrong. When an answer is bad, the fix depends entirely on which stage failed.
- Embed: The user query is converted to a vector using the same embedding model used for documents. If this vector does not capture the query intent well, everything downstream fails.
- Search: The query vector is compared against all document vectors in the database. The top-k most similar results are returned. The value of k (typically 3 to 10) balances context richness against noise and token cost.
- Retrieve: The actual text chunks corresponding to the top-k vectors are fetched along with their metadata. This is where filtering by source, date, or category can narrow results to the most relevant subset.
- Augment: Retrieved chunks are injected into a prompt template alongside the original question and grounding instructions. This is the most underestimated stage — the prompt design determines whether the LLM uses the context faithfully or ignores it.
- Generate: The LLM produces an answer grounded in the provided context. Temperature, system prompt, and citation requirements all influence output quality. A low temperature (0.1 to 0.3) reduces creative hallucination.
Chunking is where information architecture meets retrieval quality. How you split documents determines what the system can find and how useful the retrieved context will be.
- Size tradeoffs: Small chunks (100-200 words) give precise retrieval — the matched text is tightly relevant — but may lack the surrounding context needed to understand the information. Large chunks (400-600 words) preserve context but may dilute relevance with tangential content. Most production systems settle between 200 and 400 words.
- Overlap strategy: Adjacent chunks should share 10-20% of their text. Overlap ensures that information split across a chunk boundary is not lost. Without overlap, a key sentence at the edge of a chunk might be separated from the context it needs, making it unretrievable or misleading when retrieved alone.
- Document-type-specific approaches: One chunking strategy does not fit all documents. Technical documentation benefits from section-based splitting that respects heading hierarchies. Conversational transcripts need speaker-turn-aware chunking. Legal documents require clause-level segmentation. Code should be chunked by function or class boundaries, not arbitrary line counts.
The augmented prompt is where retrieval becomes generation. A poorly constructed prompt will waste even the best retrieval results. This is the most controllable and often most impactful lever in the entire system.
- Grounding instructions: Explicit instructions that tell the LLM to answer ONLY from the provided context. Without these, the model will freely mix retrieved facts with its training data, producing confident-sounding answers that blend truth with fabrication. Example: "Answer the question using ONLY the information in the context below. If the context does not contain enough information, say so."
- Citation requirements: Requiring the LLM to cite which chunk or source supports each claim forces it to ground every statement. This both reduces hallucination and gives users a way to verify answers. Citations also make evaluation much easier — you can automatically check whether cited sources actually support the claims made.
- Hallucination prevention: Beyond grounding instructions, effective strategies include lowering temperature (0.1-0.3), explicitly instructing the model to say "I don't know" when context is insufficient, separating retrieved context from the question with clear delimiters, and ordering chunks by relevance score so the most relevant information appears first in the context window.
Hybrid search is the recognition that neither keyword search nor vector search is universally superior. Each has blind spots that the other covers, and combining them produces the most robust retrieval.
- BM25 + vector: BM25 is a term-frequency-based algorithm that excels at exact-match retrieval — error codes, product SKUs, statute numbers, technical identifiers. Vector search excels at meaning-based retrieval — finding documents about "automobile maintenance" when the query says "car repair." Hybrid search runs both in parallel and merges results.
- Alpha weighting: The alpha parameter controls the blend. An alpha of 0.0 is pure keyword search, 1.0 is pure vector search, and 0.5 is an equal mix. The optimal alpha depends on your domain. Legal and medical domains with exact terminology often perform best at 0.3-0.4 (keyword-heavy). Creative or conversational domains benefit from 0.6-0.8 (semantic-heavy). Tuning alpha on your evaluation dataset is one of the highest-ROI optimizations available.
- When to use each: Use pure vector search when queries are natural language and the corpus uses varied vocabulary. Use pure keyword search when queries contain identifiers that must match exactly. Use hybrid (the default recommendation) when your query mix includes both types, or when you are unsure — hybrid rarely underperforms the better individual method by much, and often outperforms both.
Common RAG Mistakes
These are the errors that trip up even experienced engineers. Each one is easy to make and hard to diagnose because the system still produces answers — just bad ones. Knowing these patterns lets you avoid them in your own builds and diagnose them quickly when reviewing others.
Using one embedding model to index documents and a different model to embed queries. The two models produce vectors in different geometric spaces, so similarity scores become meaningless. The system still returns results — they are just random. This is the single most common and most damaging mistake because it silently degrades every query without any error message. Always verify that your indexing pipeline and query pipeline use the exact same model name and version.
Chunks too large (1000+ words) dilute the embedding with off-topic content, so the vector represents an average of many ideas instead of one clear concept. The system retrieves chunks that are only vaguely relevant. Chunks too small (under 50 words) lose all surrounding context, so retrieved text is technically relevant but useless for answering the question. The fix is empirical: test 3-4 chunk sizes on representative queries and measure retrieval precision. Most domains land between 200 and 400 words.
Injecting retrieved context into the prompt without telling the LLM to use only that context. The model defaults to blending retrieved facts with its training knowledge, producing answers that look well-grounded but contain fabricated details. Without explicit grounding instructions, you have built a system that looks like RAG but behaves like a vanilla LLM with extra tokens in its prompt. Always include a directive like "Answer ONLY from the provided context."
Deploying a RAG system without measuring faithfulness, relevance, and completeness. Without metrics, you have no way to know if a change to chunking, prompts, or retrieval parameters made things better or worse. Every decision becomes guesswork. Set up an evaluation pipeline with at least 20-30 representative questions and ground-truth answers before you start tuning. Measure faithfulness first — a system that hallucinates is worse than one that says "I don't know."
Relying solely on vector search when the domain includes identifiers, codes, or exact terminology. Vector search understands that "HTTP 404" and "page not found" are related, but it cannot guarantee an exact match on the string "ERR-4821-X" because the embedding might map it near similar-looking codes. Any domain with error codes, product IDs, legal references, medical codes, or technical identifiers needs hybrid search with meaningful keyword weight. This is not optional — it is a correctness requirement.
Splitting documents into perfectly adjacent, non-overlapping chunks. This creates hard boundaries where a key sentence at the end of chunk N and the beginning of chunk N+1 are separated. Neither chunk alone captures the full idea, and neither will be retrieved for a query about that idea. Adding 10-20% overlap between adjacent chunks ensures boundary information appears in at least one chunk with enough surrounding context to be useful. This is a one-line configuration change with outsized impact on retrieval quality.
Production RAG Checklist
Before deploying any RAG system to production, walk through this checklist. Each item addresses a failure mode that has caused real production incidents. A system that passes all of these checks is ready for users. One that fails any of them has a ticking time bomb.
Verify that the exact same embedding model name AND version is used in the indexing pipeline and the query pipeline. Pin the model version in your configuration — do not use "latest." Document the model choice and the date it was set. If you ever need to change models, you must re-embed every document in your corpus. There are no shortcuts.
Confirm that chunk overlap is set to 10-20% of chunk size. Test with a query whose answer spans a chunk boundary in your corpus — if the system cannot retrieve it, your overlap is insufficient or missing. Log the chunk size and overlap values in your system configuration for future reference.
Run your evaluation suite and confirm that the faithfulness (groundedness) metric is above 0.9 across your test set. If it is below 0.9, your system is hallucinating on more than 10% of answers — unacceptable for production. Diagnose whether the issue is in the prompt (missing grounding instructions), the retrieval (wrong chunks), or the generation (temperature too high). Do not ship until this passes.
If your domain includes any identifiers, codes, technical terms, or proper nouns that must match exactly, enable hybrid search and tune the alpha parameter on your evaluation set. Test with queries containing exact identifiers and confirm they return the correct documents. If your domain is purely conversational, pure vector search is acceptable, but document the decision.
Verify that every prompt template used in your system contains explicit grounding instructions telling the LLM to answer only from the provided context. Include a fallback instruction for when context is insufficient ("If the provided context does not contain enough information to answer, say so"). Test this by asking a question your corpus cannot answer and confirming the system does not fabricate a response.
Configure a minimum similarity threshold so that low-relevance results are not passed to the LLM. Test by querying something completely outside your corpus and confirming the system returns no results (or a "no relevant information found" response) rather than injecting irrelevant chunks. A threshold between 0.7 and 0.85 is typical, but tune it for your data.
Every production RAG query should log: the original question, the retrieved chunk IDs and similarity scores, the assembled prompt (or a hash of it), the generated answer, and the latency of each stage. Without these logs, you cannot diagnose failures, measure drift over time, or identify queries that consistently produce poor results. Set up alerts for queries that return zero results above threshold and for latency spikes in the embedding or retrieval stages.
Your evaluation suite should run automatically on every change to chunking parameters, prompt templates, embedding models, or retrieval configuration. A regression in faithfulness or relevance should block deployment. Treat RAG evaluation the same way you treat unit tests in a CI/CD pipeline — the system should not reach production without passing.
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.