How often should I checkpoint agent state?

After every meaningful action: tool call, phase transition, or critical decision. Not after every token. A checkpoint write to SQLite takes ~1-5ms. Aim for 1-2 checkpoints per agent step, not 100.

What token budget should I set?

Start with 50-100K tokens per interaction (roughly $0.15-0.30 spend). Monitor actual usage for a week, then adjust. For long-running research agents, budget higher (200K+). For quick classification, 10-20K is plenty.

Can I use these patterns with the Agent SDK?

Yes. Checkpointing, validation, tool fallbacks, and observability all work with the Agent SDK. The SDK's event loop makes it even easier to hook into each step and checkpoint.

What if my agent gets stuck in a retry loop?

Add a max-retries limit (3-5) and an overall timeout. If validation keeps failing, halt and escalate to human review. Log the failure so you can investigate. Better to fail loudly than loop forever.

How do I monitor agents in production?

Parse your audit logs daily. Graph: (1) checkpoint frequency (should be steady), (2) tokens per interaction (should be stable), (3) validation pass rate (should be >95%), (4) tool failure rate (should be <5%). Alert if any metric degrades.

Deploying Claude Agents to Production

Build reliable agents that survive API failures, token limits, and state loss. Patterns from 10K+ interactions.

The Claude Agent SDK makes it trivial to build agents that plan, reason, and execute. Ship a proof-of-concept in an afternoon. But production agents are different. Your agent needs to survive rate limits, API outages, and token explosions. It needs to remember what failed so it doesn't retry the same broken path. It needs to cost $0.50 per interaction, not $50.

We run a production agent system at Like One that processes 10K+ interactions monthly: grant research, proposal writing, donor communication, and compliance checks. This guide shows the patterns we use to keep it running reliably.

The Production Agent Checklist (Before You Deploy)

Most agent failures fall into three buckets:

State loss — agent forgets what it's doing mid-task
Cost explosion — one runaway agent blows the monthly budget
Silent failures — agent produces wrong answer without error

Before shipping, ask:

Can my agent retry a failed step without losing context?
What's the worst-case token cost per interaction?
How does my agent detect and recover from semantic failures (wrong answer, not just API errors)?
Can I pause or cancel a running agent?
Do I log every tool call, decision, and failure reason?

If you can't answer three of these, don't deploy yet.

Pattern 1: State Checkpointing (Survive Restarts)

Store agent state to disk/DB after every meaningful action. If the agent crashes or the process restarts, resume from the last checkpoint instead of starting over.

import json
import sqlite3
from datetime import datetime
from anthropic import Anthropic
from enum import Enum

client = Anthropic()

class AgentPhase(Enum):
    INITIAL = "initial"
    RESEARCHING = "researching"
    DRAFTING = "drafting"
    REVIEWING = "reviewing"
    COMPLETE = "complete"
    FAILED = "failed"

class PersistentAgent:
    def __init__(self, agent_id: str, db_path: str = "agent_state.db"):
        self.agent_id = agent_id
        self.db_path = db_path
        self.messages = []
        self.phase = AgentPhase.INITIAL
        self.checkpoints = []
        self._init_db()
        self._load_or_init_state()
    
    def _init_db(self):
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('''
            CREATE TABLE IF NOT EXISTS agent_state (
                agent_id TEXT PRIMARY KEY,
                phase TEXT,
                messages TEXT,
                checkpoints TEXT,
                last_checkpoint TEXT,
                created_at TEXT,
                updated_at TEXT
            )
        ''')
        conn.commit()
        conn.close()
    
    def _load_or_init_state(self):
        """Resume from last checkpoint if exists, else init new state."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('SELECT phase, messages, checkpoints FROM agent_state WHERE agent_id = ?', (self.agent_id,))
        result = c.fetchone()
        conn.close()
        
        if result:
            phase_str, messages_json, checkpoints_json = result
            self.phase = AgentPhase(phase_str)
            self.messages = json.loads(messages_json)
            self.checkpoints = json.loads(checkpoints_json)
            print(f"✅ Resumed agent {self.agent_id} from phase {self.phase.value}")
        else:
            # Initialize new agent
            self._save_checkpoint("initialized")
    
    def _save_checkpoint(self, reason: str):
        """Save current state to DB."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        
        checkpoint = {
            "reason": reason,
            "phase": self.phase.value,
            "message_count": len(self.messages),
            "timestamp": datetime.now().isoformat()
        }
        self.checkpoints.append(checkpoint)
        
        # Keep last 10 checkpoints
        if len(self.checkpoints) > 10:
            self.checkpoints = self.checkpoints[-10:]
        
        c.execute('''
            INSERT OR REPLACE INTO agent_state 
            (agent_id, phase, messages, checkpoints, last_checkpoint, updated_at)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            self.agent_id,
            self.phase.value,
            json.dumps(self.messages),
            json.dumps(self.checkpoints),
            datetime.now().isoformat(),
            datetime.now().isoformat()
        ))
        conn.commit()
        conn.close()
    
    def transition(self, new_phase: AgentPhase, reason: str = ""):
        """Move to next phase and checkpoint."""
        self.phase = new_phase
        self._save_checkpoint(f"phase_transition: {reason}")
    
    def step(self, system_prompt: str, user_input: str) -> str:
        """One reasoning step. Checkpoint after each step."""
        try:
            self.messages.append({"role": "user", "content": user_input})
            
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                system=system_prompt,
                messages=self.messages
            )
            
            assistant_message = response.content[0].text
            self.messages.append({"role": "assistant", "content": assistant_message})
            
            # Checkpoint success
            self._save_checkpoint(f"step_success: {user_input[:50]}")
            return assistant_message
            
        except Exception as e:
            # Checkpoint failure (but don't lose conversation history)
            self._save_checkpoint(f"step_failed: {str(e)[:100]}")
            raise

# Usage: Multi-step agent with recovery
agent = PersistentAgent(agent_id="research_agent_001")

# On restart, agent resumes from last phase
if agent.phase == AgentPhase.INITIAL:
    agent.transition(AgentPhase.RESEARCHING, "starting research")
    result = agent.step(
        system_prompt="You are a research assistant for nonprofits. Find grants.",
        user_input="What foundation grants exist for HIV cure research?"
    )
    print(f"Research result: {result}")

# If process crashes here, next startup will be in RESEARCHING phase
# with all prior messages intact.

Key benefit: If your Python process crashes mid-task, the next startup resumes automatically. No lost work. No token waste on re-processing.

Cost impact: Zero. SQLite write = ~1ms. Save checkpoints after every meaningful action, not after every token.

Pattern 2: Token Budget & Circuit Breaker

Track tokens consumed per agent interaction. If you're about to exceed the budget, halt the agent instead of letting it spiral.

import json
from anthropic import Anthropic

client = Anthropic()

class BudgetedAgent:
    def __init__(self, max_tokens_per_interaction: int = 50000, model: str = "claude-opus-4-6"):
        self.max_tokens = max_tokens_per_interaction
        self.tokens_used = 0
        self.model = model
        self.messages = []
    
    def estimate_tokens(self, text: str) -> int:
        """Rough estimate: ~1 token per 3 characters."""
        return len(text) // 3
    
    def step(self, system_prompt: str, user_input: str, max_step_tokens: int = 2048) -> tuple:
        """Execute one step. Returns (response, tokens_used, budget_ok)."""
        
        # Estimate cost BEFORE calling API
        input_estimate = self.estimate_tokens(system_prompt + str(self.messages) + user_input)
        output_estimate = max_step_tokens
        total_estimate = input_estimate + output_estimate
        
        # Circuit breaker: halt if over budget
        if self.tokens_used + total_estimate > self.max_tokens:
            return (
                f"❌ Token budget exceeded. Used {self.tokens_used}/{self.max_tokens}. Stopping.",
                0,
                False
            )
        
        # Safe to proceed
        self.messages.append({"role": "user", "content": user_input})
        
        response = client.messages.create(
            model=self.model,
            max_tokens=max_step_tokens,
            system=system_prompt,
            messages=self.messages
        )
        
        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})
        
        # Track actual usage
        actual_input = response.usage.input_tokens
        actual_output = response.usage.output_tokens
        actual_total = actual_input + actual_output
        self.tokens_used += actual_total
        
        return (
            assistant_message,
            actual_total,
            self.tokens_used < self.max_tokens
        )

# Usage
agent = BudgetedAgent(max_tokens_per_interaction=100000)  # $0.30 max spend per task

response, tokens, ok = agent.step(
    system_prompt="You are a grant researcher.",
    user_input="Find 10 AI grants with $1M+ budgets."
)

if not ok:
    print("⚠️  Budget exceeded. Halting agent.")
else:
    print(f"✅ Step used {tokens} tokens. Budget remaining: {agent.max_tokens - agent.tokens_used}")

Prevents: Runaway agents that loop endlessly. One bad prompt = one expensive mistake, not a $1000 bill.

Pattern 3: Failure Detection & Semantic Validation

API success != correct answer. Your agent might produce confident nonsense. Detect semantic failures before they propagate.

import json
from anthropic import Anthropic

client = Anthropic()

class ValidatingAgent:
    def __init__(self):
        self.messages = []
        self.validation_failures = []
    
    def validate_grant_response(self, response: str) -> dict:
        """Use Claude to validate if grant response is credible."""
        
        validation_prompt = f"""Is this grant research output credible and complete?
        
Output:
{response}

Respond in JSON:
{{
  "is_valid": true/false,
  "issues": ["issue1", "issue2"],
  "confidence": 0.0-1.0,
  "suggestion": "what to do next"
}}

Be strict. Empty lists = red flag. Missing eligibility info = red flag."""
        
        validation = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=500,
            messages=[{"role": "user", "content": validation_prompt}]
        ).content[0].text
        
        try:
            return json.loads(validation)
        except:
            return {"is_valid": False, "issues": ["Validation response malformed"]}
    
    def research_grants(self, query: str) -> dict:
        """Research grants with validation loop."""
        
        self.messages.append({"role": "user", "content": query})
        
        max_retries = 3
        for attempt in range(max_retries):
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                system="You are a grant researcher. Return structured grant data.",
                messages=self.messages
            ).content[0].text
            
            self.messages.append({"role": "assistant", "content": response})
            
            # Validate
            validation = self.validate_grant_response(response)
            
            if validation["is_valid"] and validation["confidence"] > 0.8:
                return {
                    "status": "success",
                    "response": response,
                    "validation": validation,
                    "attempts": attempt + 1
                }
            
            # Semantic failure detected
            self.validation_failures.append({
                "attempt": attempt,
                "issues": validation["issues"],
                "suggestion": validation["suggestion"]
            })
            
            # Retry with feedback
            if attempt < max_retries - 1:
                retry_prompt = f"""Your previous response had issues:\n
{json.dumps(validation['issues'], indent=2)}

Try again. Be thorough. {validation['suggestion']}"""
                self.messages.append({"role": "user", "content": retry_prompt})
        
        # All retries failed
        return {
            "status": "failed_validation",
            "failures": self.validation_failures,
            "last_response": response
        }

# Usage
agent = ValidatingAgent()
result = agent.research_grants(
    "Find 5 grants for AI nonprofits under $50K with short turnaround (30 days)."
)

if result["status"] == "success":
    print(f"✅ Grant research complete ({result['attempts']} attempts)")
else:
    print(f"❌ Research failed validation: {result['failures']}")

Catches: When agent hallucinates grant names, misses eligibility requirements, or returns empty lists. Retry automatically with feedback.

Pattern 4: Tool Failure Recovery

When your agent calls a tool (database query, API, file write), that tool can fail. Have a fallback.

from anthropic import Anthropic
import json
import time

client = Anthropic()

class ResilientToolAgent:
    def __init__(self):
        self.messages = []
        self.tools = [
            {
                "name": "search_grants",
                "description": "Search foundation grants database",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "keyword": {"type": "string"},
                        "max_budget": {"type": "number"}
                    }
                }
            }
        ]
    
    def execute_tool(self, tool_name: str, tool_input: dict) -> dict:
        """Execute tool with retry logic."""
        
        if tool_name == "search_grants":
            keyword = tool_input.get("keyword")
            max_budget = tool_input.get("max_budget", 1000000)
            
            # Try primary database
            try:
                results = self._query_grants_db(keyword, max_budget)
                if results:
                    return {"status": "success", "results": results}
            except Exception as e:
                print(f"⚠️  Primary DB failed: {e}. Trying fallback...")
            
            # Fallback: cache or alternative source
            try:
                results = self._query_grants_cache(keyword)
                return {
                    "status": "partial",
                    "results": results,
                    "note": "Using cached data. Results may be stale."
                }
            except:
                pass
            
            # Last resort: return empty but don't crash
            return {
                "status": "failed",
                "results": [],
                "error": "Both primary and fallback failed. No grants found."
            }
    
    def _query_grants_db(self, keyword, max_budget):
        # Simulated DB query
        raise Exception("Database connection timeout")
    
    def _query_grants_cache(self, keyword):
        # Fallback to cached data
        return [{"name": "AWS Imagine Grant", "budget": 50000}]
    
    def step_with_tools(self, user_input: str) -> str:
        """Execute agent step, handling tool failures."""
        
        self.messages.append({"role": "user", "content": user_input})
        
        while True:
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                system="You are a grant researcher with access to search_grants tool.",
                tools=self.tools,
                messages=self.messages
            )
            
            # Check if agent is done (no tool calls)
            if not any(block.type == "tool_use" for block in response.content):
                text = next(
                    (block.text for block in response.content if hasattr(block, "text")),
                    "No response"
                )
                self.messages.append({"role": "assistant", "content": response.content})
                return text
            
            # Handle tool calls
            for block in response.content:
                if block.type == "tool_use":
                    tool_result = self.execute_tool(block.name, block.input)
                    self.messages.append({"role": "assistant", "content": response.content})
                    self.messages.append({
                        "role": "user",
                        "content": [{
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(tool_result)
                        }]
                    })

# Usage
agent = ResilientToolAgent()
result = agent.step_with_tools("Find AI grants under $50K")
print(result)

Prevents: One failed API call from breaking your entire agent. Fallbacks ensure agent completes task (even if degraded).

Pattern 5: Observability — Logging Every Decision

If an agent misbehaves in production, you need to know exactly what it did. Log every API call, tool invocation, and decision point.

import json
import logging
from datetime import datetime
from anthropic import Anthropic

client = Anthropic()

# Structured logging for agents
logger = logging.getLogger("agent")
handler = logging.FileHandler("agent_audit.log")
formatter = logging.Formatter(
    json.dumps({"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"})
)
handler.setFormatter(formatter)
logger.addHandler(handler)

class ObservableAgent:
    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self.messages = []
        self.session_id = datetime.now().isoformat()
    
    def log_event(self, event_type: str, details: dict):
        """Log agent action."""
        event = {
            "session_id": self.session_id,
            "agent_id": self.agent_id,
            "event_type": event_type,
            "timestamp": datetime.now().isoformat(),
            **details
        }
        logger.info(json.dumps(event))
    
    def step(self, system_prompt: str, user_input: str) -> str:
        """Step with full audit trail."""
        
        self.log_event("step_start", {"user_input": user_input[:100]})
        
        self.messages.append({"role": "user", "content": user_input})
        
        try:
            response = client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                system=system_prompt,
                messages=self.messages
            )
            
            assistant_message = response.content[0].text
            self.messages.append({"role": "assistant", "content": assistant_message})
            
            # Log API usage
            self.log_event("api_call", {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "model": "claude-opus-4-6",
                "response_length": len(assistant_message)
            })
            
            return assistant_message
            
        except Exception as e:
            self.log_event("step_failed", {"error": str(e), "error_type": type(e).__name__})
            raise

# Usage
agent = ObservableAgent(agent_id="research_agent_001")
agent.step(
    system_prompt="You research grants.",
    user_input="What AI grants exist?"
)

# Log file now contains:
# {"session_id": "2026-06-27T...", "agent_id": "research_agent_001", "event_type": "step_start", ...}
# {"session_id": "2026-06-27T...", "agent_id": "research_agent_001", "event_type": "api_call", ...}

For debugging: Parse agent_audit.log to understand exactly what happened. Which tool failed? When? With what input?

Deployment Checklist

Before shipping any agent to production:

✅ State checkpointing: Agent survives process restart
✅ Token budgets: Runaway agent can't cost $1000
✅ Validation loop: Semantic failures trigger retry
✅ Tool fallbacks: Failed API = degraded mode, not crash
✅ Audit logging: Every decision is recorded
✅ Monitoring: Alert on failed validations or budget overages
✅ Rollback plan: How to disable agent if it misbehaves
✅ Cost tracking: Graph tokens/month to catch trends

The Future: Self-Healing Agents

These patterns keep agents alive. The next frontier is agents that fix themselves: detecting failures, adjusting their strategy, and learning what works.

We're experimenting with agents that maintain a "failure journal" — every mistake builds a knowledge base of what NOT to do. Your agent becomes smarter the longer it runs.

For now: implement state checkpointing and token budgets. Those two patterns alone prevent 80% of production agent problems.