Evaluation Metrics & Benchmarks.

If you cannot measure it, you cannot improve it. Rigorous evaluation separates production models from experiments.

After this lesson you'll know

Which metrics to use for different fine-tuning tasks (classification, generation, chat)
How to detect overfitting and catastrophic forgetting
LLM-as-judge evaluation for open-ended generation
Building automated evaluation pipelines that run after every training job

Metrics by Task Type

Different tasks require different evaluation metrics. Using the wrong metric gives you false confidence. **Classification tasks:** ``` Accuracy: Correct predictions / total predictions Use when classes are balanced. F1 Score: Harmonic mean of precision and recall Use when classes are imbalanced. Confusion Matrix: Shows exactly where the model confuses classes. Always inspect this, even when accuracy is high. Per-class metrics: Calculate precision/recall per class. A model with 95% overall accuracy might have 0% recall on your most important class. ``` ```python from sklearn.metrics import classification_report, confusion_matrix # After generating predictions on test set y_true = ["lease", "nda", "lease", "employment", "nda"] y_pred = ["lease", "nda", "employment", "employment", "nda"] print(classification_report(y_true, y_pred)) print(confusion_matrix(y_true, y_pred, labels=["lease", "nda", "employment"])) ``` **Text generation tasks:** ``` Perplexity: How "surprised" the model is by the test data. Lower = better. Good for comparing models on the same test set. Not meaningful in isolation. BLEU/ROUGE: N-gram overlap with reference outputs. Useful for translation and summarization. Poor for creative or open-ended generation. Exact Match: Does the output exactly match the expected output? Good for structured outputs (JSON, code, SQL). Custom metrics: Task-specific checks. - JSON validity rate (for structured output tasks) - Code execution pass rate (for code generation) - Regex match rate (for format compliance) ``` **Open-ended generation (chat, creative writing):** ``` LLM-as-Judge: Use a stronger model (Claude, GPT-4) to rate outputs on specified criteria. The most reliable automated evaluation for subjective quality. Human evaluation: Gold standard but expensive and slow. Use for final validation, not iterative development. Win rate: Compare fine-tuned vs base model outputs side-by-side. What percentage does the fine-tuned model win? ```

Never use a single metric. Every fine-tuned model should be evaluated on at least three metrics: one automated task-specific metric, one general quality metric (perplexity or LLM-as-judge), and one overfitting detector (train vs validation loss gap).

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.