Error Handling

Agents fail. Good agents fail gracefully. Here are the five most common failure modes, how to detect them, and production code patterns to handle each one.

The Five Failure Modes

Every agent you build will encounter these failures. Learning them now means you design for resilience from day one.

1. Tool Failure

An API returns 500, a database query times out, a service is down. The tool call fails but the agent should not crash.

Example: Weather API returns 503 Service Unavailable after your agent promised to check the forecast.

2. Invalid Data

The tool returns data, but it is wrong — negative prices, dates in the future, missing required fields. The agent must detect and handle corrupt data.

Example: Database returns a customer balance of -$50,000. That is a data bug, not real debt.

3. Ambiguous Input

The user's request is too vague to act on safely. "Do the thing from last time" when there is no context. Acting on low confidence causes more damage than asking for clarity.

Example: "Fix the issue" — which issue? In which system? What counts as fixed?

4. Guardrail Violation

The user requests something the agent is explicitly forbidden from doing. The agent must refuse while remaining helpful.

Example: "Send this confidential report to all 500 employees" when guardrails restrict confidential docs.

5. Stuck Loop

The agent keeps retrying the same failed approach without making progress. 48 attempts at the same fix with zero improvement.

Example: Agent tries to fix a failing test by changing the same line of code, failing each time.

Pattern 1: Retry with Exponential Backoff

For transient failures (network errors, rate limits, temporary outages), retry with increasing wait times:

      import time

      def retry_with_backoff(func, max_retries=3):

        """Retry a function with exponential backoff: 1s, 2s, 4s"""

        for attempt in range(max_retries):

          try:

            return func()

          except Exception as e:

            if attempt == max_retries - 1:

              raise  # Last attempt — let it fail

            wait = 2 ** attempt  # 1s, 2s, 4s

            print(f"Retry {attempt + 1}/{max_retries} in {wait}s: {e}")

            time.sleep(wait)

      # Usage in your agent

      def execute_tool_safe(name, params):

        try:

          return retry_with_backoff(

            lambda: execute_tool(name, params)

          )

        except Exception as e:

          # All retries exhausted — return error to Claude

          return {"error": str(e), "tool": name}

Why exponential backoff?

If a service is overloaded, hammering it with rapid retries makes the problem worse. Exponential backoff (1s, 2s, 4s) gives the service progressively more time to recover while still attempting the call.

Pattern 2: Graceful Degradation

When the primary tool fails, fall back to an alternative instead of giving up:

      def get_weather(location):

        # Try primary API

        try:

          return retry_with_backoff(

            lambda: weather_api.get(location)

          )

        except:

          pass

        # Fallback: try web search

        try:

          return web_search.query(f"weather in {location} today")

        except:

          pass

        # All fallbacks exhausted

        return {

          "error": "Weather data unavailable",

          "tried": ["weather_api", "web_search"],

          "suggestion": "Try again in a few minutes"

        }

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.