Reliability & Fault Tolerance.
Building AI systems that survive the chaos of production.
After this lesson you'll know
- How to implement retries with exponential backoff and jitter
- Circuit breaker patterns for AI API calls
- Fallback strategies: model degradation, cached responses, graceful failure
- Timeout budgets and how to prevent cascading failures
The Reliability Imperative
AI APIs fail. OpenAI has outages. Anthropic rate-limits you. Your vector database hiccups during peak load. Network partitions happen. The question is never "will my system fail?" -- it's "what happens when it does?" Production AI systems face unique reliability challenges that traditional web services don't: - **Non-deterministic outputs**: The same input can produce different outputs, making retries semantically complex. - **Long-running calls**: A single model call can take 5-30 seconds, making timeout management critical. - **Cascading token costs**: Each retry costs money. Naive retry logic on expensive models can blow your budget in minutes. - **Provider dependencies**: You're relying on third-party APIs with their own SLAs (or lack thereof).
Industry data: OpenAI's API has historically averaged 99.5-99.8% uptime. That sounds high until you do the math: 99.5% uptime means ~43 minutes of downtime per week. If you serve 1,000 requests per hour, that's 700+ failed requests weekly.
Retries Done Right
Naive retries (just try again immediately) make everything worse. They amplify load on an already struggling service, increase your costs, and can trigger rate limits that cascade into longer outages. ```python import asyncio import random async def retry_with_backoff( fn, max_retries=3, base_delay=1.0, max_delay=30.0, retryable_errors=(RateLimitError, TimeoutError, ServerError) ): for attempt in range(max_retries + 1): try: return await fn() except retryable_errors as e: if attempt == max_retries: raise # Exponential backoff with full jitter delay = min(base_delay * (2 ** attempt), max_delay) jitter = random.uniform(0, delay) actual_delay = jitter logger.warning( f"Attempt {attempt + 1} failed: {e}. " f"Retrying in {actual_delay:.1f}s" ) await asyncio.sleep(actual_delay) ``` Three rules for AI retries: 1. **Only retry transient errors.** A 400 (bad request) will fail every time. A 429 (rate limit) or 503 (overloaded) is worth retrying. 2. **Use full jitter.** Without jitter, all clients retry at the same time, creating a thundering herd that re-crashes the service. 3. **Cap your retry budget.** Three retries with exponential backoff is usually sufficient. More than five is almost never the right answer.
Cost awareness: If you're retrying Claude Opus calls, each retry costs the same as the original. Three retries on a $0.15 call means you might spend $0.60 total. Factor retry budgets into your cost models.
This lesson is for Pro members
Unlock all 518+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.