Interactive +200 XP ~50 min

Evaluation Metrics

If you can't measure it, you can't improve it. Learn the three critical dimensions of RAG quality and how to score them systematically.

The RAG quality triangle: A good RAG answer must be (1) Relevant — the retrieved context actually relates to the question, (2) Faithful — the answer only contains claims supported by the context, and (3) Complete — the answer covers all the important information from the context. Rate the scenarios below.

Automated evaluation: In production, you use LLM-as-a-judge to score these metrics automatically. You send the question + context + answer to GPT-4 and ask it to rate relevance, faithfulness, and completeness on a 1-5 scale. This lets you evaluate thousands of question-answer pairs without manual review.

Evaluation Frameworks

RAGAS

Open-source framework for RAG evaluation. Measures faithfulness, answer relevancy, context precision, and context recall. The most popular automated RAG evaluation tool.

DeepEval

LLM evaluation framework with RAG-specific metrics: hallucination, answer relevancy, contextual precision/recall. Integrates with CI/CD pipelines.

TruLens

Evaluation and tracking for LLM apps. Provides the "RAG Triad" of metrics: answer relevance, context relevance, and groundedness.

Custom LLM Judge

Build your own evaluator by prompting GPT-4: "Rate this answer's faithfulness to the context on 1-5. Explain." Simple, flexible, and domain-adaptable.

🔒

This lesson is for Pro members

Unlock all 300+ lessons across 30 courses with Academy Pro. Founding members get 90% off — forever.

Go Pro — $4.90/mo ← Back to course

Already a member? Sign in to access your lessons.