Error Handling
Agents fail. Good agents fail gracefully. Here are the five most common failure modes, how to detect them, and production code patterns to handle each one.
The Five Failure Modes
Every agent you build will encounter these failures. Learning them now means you design for resilience from day one.
An API returns 500, a database query times out, a service is down. The tool call fails but the agent should not crash.
Example: Weather API returns 503 Service Unavailable after your agent promised to check the forecast.
The tool returns data, but it is wrong — negative prices, dates in the future, missing required fields. The agent must detect and handle corrupt data.
Example: Database returns a customer balance of -$50,000. That is a data bug, not real debt.
The user's request is too vague to act on safely. "Do the thing from last time" when there is no context. Acting on low confidence causes more damage than asking for clarity.
Example: "Fix the issue" — which issue? In which system? What counts as fixed?
The user requests something the agent is explicitly forbidden from doing. The agent must refuse while remaining helpful.
Example: "Send this confidential report to all 500 employees" when guardrails restrict confidential docs.
The agent keeps retrying the same failed approach without making progress. 48 attempts at the same fix with zero improvement.
Example: Agent tries to fix a failing test by changing the same line of code, failing each time.
Pattern 1: Retry with Exponential Backoff
For transient failures (network errors, rate limits, temporary outages), retry with increasing wait times:
def retry_with_backoff(func, max_retries=3):
"""Retry a function with exponential backoff: 1s, 2s, 4s"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise # Last attempt — let it fail
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Retry {attempt + 1}/{max_retries} in {wait}s: {e}")
time.sleep(wait)
# Usage in your agent
def execute_tool_safe(name, params):
try:
return retry_with_backoff(
lambda: execute_tool(name, params)
)
except Exception as e:
# All retries exhausted — return error to Claude
return {"error": str(e), "tool": name}
If a service is overloaded, hammering it with rapid retries makes the problem worse. Exponential backoff (1s, 2s, 4s) gives the service progressively more time to recover while still attempting the call.
Pattern 2: Graceful Degradation
When the primary tool fails, fall back to an alternative instead of giving up:
# Try primary API
try:
return retry_with_backoff(
lambda: weather_api.get(location)
)
except:
pass
# Fallback: try web search
try:
return web_search.query(f"weather in {location} today")
except:
pass
# All fallbacks exhausted
return {
"error": "Weather data unavailable",
"tried": ["weather_api", "web_search"],
"suggestion": "Try again in a few minutes"
}
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.