Production Fine-Tuning Patterns.

Battle-tested architectures for fine-tuning at scale in real production systems.

After this lesson you'll know

Five production patterns: router, cascade, ensemble, distillation, and multi-LoRA
How to combine fine-tuned models with RAG for maximum capability
Cost optimization strategies that cut inference bills by 60-80%
The complete production fine-tuning checklist

Pattern 1: The Router Pattern

Use a lightweight classifier to route queries to specialized fine-tuned models. Each model is an expert in its domain. ``` User Query → Router (small model or classifier) │ ├─ Legal queries → Legal LoRA adapter ├─ Medical queries → Medical LoRA adapter ├─ Code queries → Code LoRA adapter └─ General queries → Base model (no adapter) ``` **Implementation:** ```python from transformers import pipeline # Router: lightweight classifier router = pipeline( "text-classification", model="./query-router-model", # Fine-tuned BERT or similar device="cpu", # Router is tiny, runs on CPU ) # Adapters loaded on GPU ADAPTERS = { "legal": "./adapters/legal-lora", "medical": "./adapters/medical-lora", "code": "./adapters/code-lora", } async def route_and_respond(query): # Step 1: Classify the query category = router(query)[0]["label"] # Step 2: Load appropriate adapter if category in ADAPTERS: model.load_adapter(ADAPTERS[category]) model.set_adapter(category) else: model.disable_adapter_layers() # Step 3: Generate response return generate(model, query) ``` **Why this pattern works:** - Each adapter is small (10-100MB) and can be hot-swapped - The router adds <5ms latency (negligible) - Each domain gets a specialist model without multiplying GPU costs - New domains are added by training new adapters, not new models

The router itself can be a fine-tuned model. Train a tiny classifier (BERT-tiny, 15M parameters) on 1,000 labeled queries. It will achieve 95%+ routing accuracy and add essentially zero latency to the pipeline.

Pattern 2: The Cascade Pattern

Start with a cheap, fast model. Escalate to a larger, expensive model only when the cheap model is uncertain. ``` User Query → Small Model (8B, fine-tuned) │ ├─ Confident (>90% score) → Return response │ └─ Uncertain (<90% score) → Large Model (70B) │ └─ Return response ``` ```python def cascade_inference(query, small_model, large_model, threshold=0.9): """Use cheap model first, escalate to expensive model if uncertain.""" # Step 1: Try small model small_response = small_model.generate( query, output_scores=True, return_dict_in_generate=True, ) # Step 2: Calculate confidence logprobs = small_response.scores avg_confidence = compute_avg_token_probability(logprobs) # Step 3: Route based on confidence if avg_confidence > threshold: return { "response": decode(small_response), "model": "small", "confidence": avg_confidence, "cost": SMALL_MODEL_COST, } else: large_response = large_model.generate(query) return { "response": decode(large_response), "model": "large", "confidence": None, "cost": LARGE_MODEL_COST, } ``` **Economics of the cascade:** ``` Assumption: 70% of queries are "easy" (small model handles them) Small model cost: $0.001 per query Large model cost: $0.010 per query Without cascade: 100% * $0.010 = $0.010 per query With cascade: 70% * $0.001 + 30% * $0.010 = $0.0037 per query Savings: 63% cost reduction with minimal quality loss ```

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.