📚Academy
likeone
online

Multi-Model Routing & Fallbacks.

Orchestrating multiple models for cost, speed, and resilience.

After this lesson you'll know

  • How to build a unified model abstraction layer across providers
  • Routing strategies: rule-based, classifier-based, and cascading
  • Fallback chains that degrade gracefully without user impact
  • A/B testing and canary deployments for model changes

The Model Abstraction Layer

Coupling your application to a single model is a business risk. Models get deprecated, pricing changes, quality fluctuates, and outages happen. A model abstraction layer decouples your application logic from any specific provider. ```python class ModelClient: """Unified interface across all providers.""" def __init__(self): self.providers = { "anthropic": AnthropicAdapter(), "openai": OpenAIAdapter(), "local": OllamaAdapter(), } async def generate(self, prompt, model_id, **kwargs): provider, model = self.parse_model_id(model_id) # e.g., "anthropic/claude-sonnet-4-20250514" -> provider=anthropic, model=claude-sonnet-4-20250514 adapter = self.providers[provider] return await adapter.generate(prompt, model, **kwargs) class AnthropicAdapter: async def generate(self, prompt, model, **kwargs): response = await self.client.messages.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=kwargs.get("max_tokens", 1024), ) return UnifiedResponse( text=response.content[0].text, input_tokens=response.usage.input_tokens, output_tokens=response.usage.output_tokens, model=model, provider="anthropic", ) ``` Every adapter normalizes its provider's response format into a `UnifiedResponse`. Your application code never touches provider-specific APIs directly. Switching from Claude to GPT-4 becomes a configuration change, not a code change.
Adapter parity: Not all models support the same features (tool use, vision, structured output, streaming). Your adapter layer should declare capabilities per model so the router can make informed decisions. Don't discover missing features at runtime.

Routing Strategies

Three approaches to deciding which model handles each request: **Strategy 1: Rule-based routing.** Fastest, cheapest, most transparent. Use when task types are well-defined. ```python ROUTING_RULES = { "classification": "openai/gpt-4o-mini", # Fast, cheap, good enough "summarization": "anthropic/claude-haiku", # Concise, fast "code_generation": "anthropic/claude-sonnet-4-20250514", # Strong at code "creative_writing":"anthropic/claude-opus-4-20250514", # Best quality "embedding": "openai/text-embedding-3-small", "translation": "openai/gpt-4o", # Strong multilingual } ``` **Strategy 2: Classifier-based routing.** A small model or ML classifier evaluates each request and selects the appropriate model tier. ```python class ClassifierRouter: async def route(self, request): # Score complexity with a lightweight model (~10ms, ~$0.00001) complexity = await self.classifier.score(request.text) if complexity < 0.3: return "openai/gpt-4o-mini" elif complexity < 0.7: return "anthropic/claude-sonnet-4-20250514" else: return "anthropic/claude-opus-4-20250514" ``` **Strategy 3: Cascading.** Start with the cheapest model. If its response doesn't pass quality checks, escalate to the next tier. This guarantees quality while minimizing cost. ```python class CascadingRouter: CASCADE = [ {"model": "openai/gpt-4o-mini", "quality_threshold": 0.8}, {"model": "anthropic/claude-sonnet-4-20250514", "quality_threshold": 0.7}, {"model": "anthropic/claude-opus-4-20250514", "quality_threshold": 0.0}, # Always accept ] async def generate(self, prompt): for tier in self.CASCADE: response = await self.client.generate(prompt, tier["model"]) quality = await self.quality_scorer.score(prompt, response) if quality >= tier["quality_threshold"]: return response # Good enough for this tier return response # Final tier always returns ```
Cascading trade-off: Cascading guarantees quality but adds latency for hard requests (potentially 3x the latency if all tiers are tried). Use it for async tasks where latency is acceptable. For real-time chat, classifier-based routing is usually better.
🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai