After this lesson you'll know
- How to format training data for instruction fine-tuning (chat and completion formats)
- Quality filtering techniques that eliminate noisy examples
- How to synthetically generate training data using stronger models
- Dataset sizing: how many examples you actually need
Data Formats
Data Formats
01ConceptUnderstand the core idea
→
02ApplySee it in practice
→
03BuildUse it in your projects
Master data formats step by step.
Match your training format exactly to your inference format. If you will use a system prompt in production, include system prompts in training. If you will not use multi-turn conversations, do not train on multi-turn data. Format mismatch is a top-3 cause of poor fine-tuning results.
Quality Filtering Pipeline
Raw data is noisy. A single bad example can teach the model an unwanted pattern that takes 10 good examples to override. Filter aggressively. **Level 1 - Format validation:** ```python import json def validate_example(example): """Reject malformed examples.""" errors = [] # Must have messages array if "messages" not in example: errors.append("Missing 'messages' key") return errors messages = example["messages"] # Must have at least user + assistant roles = [m["role"] for m in messages] if "user" not in roles: errors.append("Missing user message") if "assistant" not in roles: errors.append("Missing assistant message") # No empty content for m in messages: if not m.get("content", "").strip(): errors.append(f"Empty content in {m['role']} message") # Assistant response length check assistant_msgs = [m for m in messages if m["role"] == "assistant"] for m in assistant_msgs: if len(m["content"]) < 10: errors.append("Assistant response suspiciously short") if len(m["content"]) > 4000: errors.append("Assistant response suspiciously long") return errors ``` **Level 2 - Content quality scoring:** ```python def quality_score(example): """Score 0-1 based on content quality signals.""" assistant_msg = [m for m in example["messages"] if m["role"] == "assistant"][0]["content"] score = 1.0 # Penalize repetitive content words = assistant_msg.split() unique_ratio = len(set(words)) / max(len(words), 1) if unique_ratio < 0.4: score -= 0.3 # Penalize very short responses if len(words) < 5: score -= 0.2 # Penalize responses that look like errors error_patterns = ["I cannot", "I'm sorry", "As an AI", "I don't have", "ERROR", "undefined"] for pattern in error_patterns: if pattern.lower() in assistant_msg.lower(): score -= 0.4 return max(score, 0) # Filter: keep examples with score > 0.7 dataset = [ex for ex in dataset if quality_score(ex) > 0.7] ``` **Level 3 - Deduplication:** ```python from datasketch import MinHash, MinHashLSH def deduplicate(dataset, threshold=0.8): """Remove near-duplicate examples using MinHash LSH.""" lsh = MinHashLSH(threshold=threshold, num_perm=128) unique = [] for i, ex in enumerate(dataset): text = str(ex["messages"]) m = MinHash(num_perm=128) for word in text.split(): m.update(word.encode('utf8')) if not lsh.query(m): lsh.insert(str(i), m) unique.append(ex) print(f"Deduplicated: {len(dataset)} -> {len(unique)}") return unique ```
The quality hierarchy: 100 perfect examples beat 1,000 mediocre examples. Always prioritize quality over quantity. If you must choose between more data and better data, choose better data every time.
Synthetic Data Generation
When you lack real training data, use a stronger model to generate synthetic examples. This is the dominant strategy for bootstrapping fine-tuning datasets. **The teacher-student pattern:** ```python import anthropic client = anthropic.Anthropic() def generate_training_example(task_description, seed_input): """Use Claude to generate a training example.""" response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": f"""Generate a training example for this task: Task: {task_description} Input: {seed_input} Respond with a JSON object containing: - "user_message": the user's query - "assistant_message": the ideal response The response should be concise, accurate, and match the exact format a fine-tuned model should produce. Only output the JSON, nothing else.""" }] ) return json.loads(response.content[0].text) # Generate 500 examples from 50 seed inputs seed_inputs = load_seed_inputs() # Your real data seeds dataset = [] for seed in seed_inputs: for variation in range(10): # 10 variations per seed example = generate_training_example( "Classify customer support tickets into categories", f"{seed} (variation {variation}: rephrase differently)" ) dataset.append(format_as_chat(example)) ``` **Quality control for synthetic data:** 1. Generate 2x more examples than you need 2. Score each with the quality pipeline above 3. Manually review a random 10% sample 4. Remove any examples that feel "off" -- trust your domain expertise 5. Final dataset should be 500-2,000 high-quality examplesDataset Sizing Guide
How much data do you need? It depends on the complexity of behavior change: ``` TASK TYPE MINIMUM EXAMPLES RECOMMENDED Simple classification 50-100 200-500 Format/style change 100-300 500-1,000 Complex task (multi-step) 500-1,000 2,000-5,000 Domain-specific behavior 1,000-2,000 5,000-10,000 Full persona/assistant 2,000-5,000 10,000+ ``` **Diminishing returns curve:** ``` Examples │ Quality 50 │ ████░░░░░░ (baseline improvement) 200 │ ██████░░░░ (significant jump) 500 │ ████████░░ (strong performance) 1,000 │ █████████░ (diminishing returns begin) 5,000 │ ██████████ (marginal gains from here) 10,000+ │ ██████████ (usually not worth the cost) ``` **Dataset split:** ``` Training set: 80-90% of data Validation set: 10-15% of data (monitor for overfitting) Test set: 5-10% held completely out (final evaluation) CRITICAL: Never include test set examples in training. Contamination produces artificially inflated metrics. ```Exercise: Build a Training Dataset
Choose a specific task (classification, format conversion, or style matching). Create 20 examples manually -- these are your gold-standard seed data. Use the synthetic generation pipeline to expand to 200 examples. Run the quality filtering pipeline. Check your deduplication rate and average quality score. Split into train (160) / validation (20) / test (20).Quiz
1What is the most important data quality principle for fine-tuning?
2What is the teacher-student pattern in synthetic data generation?
Vocabulary
What are the two primary training data formats for fine-tuning?
1. Chat format (messages array with system/user/assistant roles) for instruction tuning, 2. Completion format (prompt/completion pairs) for text generation
What are the three levels of quality filtering?
Level 1: Format validation (structure, required fields, length). Level 2: Content quality scoring (repetition, errors, refusals). Level 3: Deduplication (MinHash LSH to remove near-duplicates).
How many training examples do you need for a simple classification task?
Minimum 50-100, recommended 200-500. Diminishing returns begin around 1,000 examples for most classification tasks.
What is the recommended dataset split ratio?
Training: 80-90%. Validation: 10-15% (monitor overfitting). Test: 5-10% (held completely out for final evaluation). Never contaminate test data.
Why must training format match inference format exactly?
Format mismatch is a top-3 cause of poor fine-tuning results. If you use system prompts in production, include them in training. If you do not use multi-turn, do not train on multi-turn data.