The Claude Agent SDK makes it trivial to build agents that plan, reason, and execute. Ship a proof-of-concept in an afternoon. But production agents are different. Your agent needs to survive rate limits, API outages, and token explosions. It needs to remember what failed so it doesn't retry the same broken path. It needs to cost $0.50 per interaction, not $50.
We run a production agent system at Like One that processes 10K+ interactions monthly: grant research, proposal writing, donor communication, and compliance checks. This guide shows the patterns we use to keep it running reliably.
The Production Agent Checklist (Before You Deploy)
Most agent failures fall into three buckets:
- State loss — agent forgets what it's doing mid-task
- Cost explosion — one runaway agent blows the monthly budget
- Silent failures — agent produces wrong answer without error
Before shipping, ask:
- Can my agent retry a failed step without losing context?
- What's the worst-case token cost per interaction?
- How does my agent detect and recover from semantic failures (wrong answer, not just API errors)?
- Can I pause or cancel a running agent?
- Do I log every tool call, decision, and failure reason?
If you can't answer three of these, don't deploy yet.
Pattern 1: State Checkpointing (Survive Restarts)
Store agent state to disk/DB after every meaningful action. If the agent crashes or the process restarts, resume from the last checkpoint instead of starting over.
import json
import sqlite3
from datetime import datetime
from anthropic import Anthropic
from enum import Enum
client = Anthropic()
class AgentPhase(Enum):
INITIAL = "initial"
RESEARCHING = "researching"
DRAFTING = "drafting"
REVIEWING = "reviewing"
COMPLETE = "complete"
FAILED = "failed"
class PersistentAgent:
def __init__(self, agent_id: str, db_path: str = "agent_state.db"):
self.agent_id = agent_id
self.db_path = db_path
self.messages = []
self.phase = AgentPhase.INITIAL
self.checkpoints = []
self._init_db()
self._load_or_init_state()
def _init_db(self):
conn = sqlite3.connect(self.db_path)
c = conn.cursor()
c.execute('''
CREATE TABLE IF NOT EXISTS agent_state (
agent_id TEXT PRIMARY KEY,
phase TEXT,
messages TEXT,
checkpoints TEXT,
last_checkpoint TEXT,
created_at TEXT,
updated_at TEXT
)
''')
conn.commit()
conn.close()
def _load_or_init_state(self):
"""Resume from last checkpoint if exists, else init new state."""
conn = sqlite3.connect(self.db_path)
c = conn.cursor()
c.execute('SELECT phase, messages, checkpoints FROM agent_state WHERE agent_id = ?', (self.agent_id,))
result = c.fetchone()
conn.close()
if result:
phase_str, messages_json, checkpoints_json = result
self.phase = AgentPhase(phase_str)
self.messages = json.loads(messages_json)
self.checkpoints = json.loads(checkpoints_json)
print(f"✅ Resumed agent {self.agent_id} from phase {self.phase.value}")
else:
# Initialize new agent
self._save_checkpoint("initialized")
def _save_checkpoint(self, reason: str):
"""Save current state to DB."""
conn = sqlite3.connect(self.db_path)
c = conn.cursor()
checkpoint = {
"reason": reason,
"phase": self.phase.value,
"message_count": len(self.messages),
"timestamp": datetime.now().isoformat()
}
self.checkpoints.append(checkpoint)
# Keep last 10 checkpoints
if len(self.checkpoints) > 10:
self.checkpoints = self.checkpoints[-10:]
c.execute('''
INSERT OR REPLACE INTO agent_state
(agent_id, phase, messages, checkpoints, last_checkpoint, updated_at)
VALUES (?, ?, ?, ?, ?, ?)
''', (
self.agent_id,
self.phase.value,
json.dumps(self.messages),
json.dumps(self.checkpoints),
datetime.now().isoformat(),
datetime.now().isoformat()
))
conn.commit()
conn.close()
def transition(self, new_phase: AgentPhase, reason: str = ""):
"""Move to next phase and checkpoint."""
self.phase = new_phase
self._save_checkpoint(f"phase_transition: {reason}")
def step(self, system_prompt: str, user_input: str) -> str:
"""One reasoning step. Checkpoint after each step."""
try:
self.messages.append({"role": "user", "content": user_input})
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system=system_prompt,
messages=self.messages
)
assistant_message = response.content[0].text
self.messages.append({"role": "assistant", "content": assistant_message})
# Checkpoint success
self._save_checkpoint(f"step_success: {user_input[:50]}")
return assistant_message
except Exception as e:
# Checkpoint failure (but don't lose conversation history)
self._save_checkpoint(f"step_failed: {str(e)[:100]}")
raise
# Usage: Multi-step agent with recovery
agent = PersistentAgent(agent_id="research_agent_001")
# On restart, agent resumes from last phase
if agent.phase == AgentPhase.INITIAL:
agent.transition(AgentPhase.RESEARCHING, "starting research")
result = agent.step(
system_prompt="You are a research assistant for nonprofits. Find grants.",
user_input="What foundation grants exist for HIV cure research?"
)
print(f"Research result: {result}")
# If process crashes here, next startup will be in RESEARCHING phase
# with all prior messages intact.
Key benefit: If your Python process crashes mid-task, the next startup resumes automatically. No lost work. No token waste on re-processing.
Cost impact: Zero. SQLite write = ~1ms. Save checkpoints after every meaningful action, not after every token.
Pattern 2: Token Budget & Circuit Breaker
Track tokens consumed per agent interaction. If you're about to exceed the budget, halt the agent instead of letting it spiral.
import json
from anthropic import Anthropic
client = Anthropic()
class BudgetedAgent:
def __init__(self, max_tokens_per_interaction: int = 50000, model: str = "claude-opus-4-6"):
self.max_tokens = max_tokens_per_interaction
self.tokens_used = 0
self.model = model
self.messages = []
def estimate_tokens(self, text: str) -> int:
"""Rough estimate: ~1 token per 3 characters."""
return len(text) // 3
def step(self, system_prompt: str, user_input: str, max_step_tokens: int = 2048) -> tuple:
"""Execute one step. Returns (response, tokens_used, budget_ok)."""
# Estimate cost BEFORE calling API
input_estimate = self.estimate_tokens(system_prompt + str(self.messages) + user_input)
output_estimate = max_step_tokens
total_estimate = input_estimate + output_estimate
# Circuit breaker: halt if over budget
if self.tokens_used + total_estimate > self.max_tokens:
return (
f"❌ Token budget exceeded. Used {self.tokens_used}/{self.max_tokens}. Stopping.",
0,
False
)
# Safe to proceed
self.messages.append({"role": "user", "content": user_input})
response = client.messages.create(
model=self.model,
max_tokens=max_step_tokens,
system=system_prompt,
messages=self.messages
)
assistant_message = response.content[0].text
self.messages.append({"role": "assistant", "content": assistant_message})
# Track actual usage
actual_input = response.usage.input_tokens
actual_output = response.usage.output_tokens
actual_total = actual_input + actual_output
self.tokens_used += actual_total
return (
assistant_message,
actual_total,
self.tokens_used < self.max_tokens
)
# Usage
agent = BudgetedAgent(max_tokens_per_interaction=100000) # $0.30 max spend per task
response, tokens, ok = agent.step(
system_prompt="You are a grant researcher.",
user_input="Find 10 AI grants with $1M+ budgets."
)
if not ok:
print("⚠️ Budget exceeded. Halting agent.")
else:
print(f"✅ Step used {tokens} tokens. Budget remaining: {agent.max_tokens - agent.tokens_used}")
Prevents: Runaway agents that loop endlessly. One bad prompt = one expensive mistake, not a $1000 bill.
Pattern 3: Failure Detection & Semantic Validation
API success != correct answer. Your agent might produce confident nonsense. Detect semantic failures before they propagate.
import json
from anthropic import Anthropic
client = Anthropic()
class ValidatingAgent:
def __init__(self):
self.messages = []
self.validation_failures = []
def validate_grant_response(self, response: str) -> dict:
"""Use Claude to validate if grant response is credible."""
validation_prompt = f"""Is this grant research output credible and complete?
Output:
{response}
Respond in JSON:
{{
"is_valid": true/false,
"issues": ["issue1", "issue2"],
"confidence": 0.0-1.0,
"suggestion": "what to do next"
}}
Be strict. Empty lists = red flag. Missing eligibility info = red flag."""
validation = client.messages.create(
model="claude-opus-4-6",
max_tokens=500,
messages=[{"role": "user", "content": validation_prompt}]
).content[0].text
try:
return json.loads(validation)
except:
return {"is_valid": False, "issues": ["Validation response malformed"]}
def research_grants(self, query: str) -> dict:
"""Research grants with validation loop."""
self.messages.append({"role": "user", "content": query})
max_retries = 3
for attempt in range(max_retries):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system="You are a grant researcher. Return structured grant data.",
messages=self.messages
).content[0].text
self.messages.append({"role": "assistant", "content": response})
# Validate
validation = self.validate_grant_response(response)
if validation["is_valid"] and validation["confidence"] > 0.8:
return {
"status": "success",
"response": response,
"validation": validation,
"attempts": attempt + 1
}
# Semantic failure detected
self.validation_failures.append({
"attempt": attempt,
"issues": validation["issues"],
"suggestion": validation["suggestion"]
})
# Retry with feedback
if attempt < max_retries - 1:
retry_prompt = f"""Your previous response had issues:\n
{json.dumps(validation['issues'], indent=2)}
Try again. Be thorough. {validation['suggestion']}"""
self.messages.append({"role": "user", "content": retry_prompt})
# All retries failed
return {
"status": "failed_validation",
"failures": self.validation_failures,
"last_response": response
}
# Usage
agent = ValidatingAgent()
result = agent.research_grants(
"Find 5 grants for AI nonprofits under $50K with short turnaround (30 days)."
)
if result["status"] == "success":
print(f"✅ Grant research complete ({result['attempts']} attempts)")
else:
print(f"❌ Research failed validation: {result['failures']}")
Catches: When agent hallucinates grant names, misses eligibility requirements, or returns empty lists. Retry automatically with feedback.
Pattern 4: Tool Failure Recovery
When your agent calls a tool (database query, API, file write), that tool can fail. Have a fallback.
from anthropic import Anthropic
import json
import time
client = Anthropic()
class ResilientToolAgent:
def __init__(self):
self.messages = []
self.tools = [
{
"name": "search_grants",
"description": "Search foundation grants database",
"input_schema": {
"type": "object",
"properties": {
"keyword": {"type": "string"},
"max_budget": {"type": "number"}
}
}
}
]
def execute_tool(self, tool_name: str, tool_input: dict) -> dict:
"""Execute tool with retry logic."""
if tool_name == "search_grants":
keyword = tool_input.get("keyword")
max_budget = tool_input.get("max_budget", 1000000)
# Try primary database
try:
results = self._query_grants_db(keyword, max_budget)
if results:
return {"status": "success", "results": results}
except Exception as e:
print(f"⚠️ Primary DB failed: {e}. Trying fallback...")
# Fallback: cache or alternative source
try:
results = self._query_grants_cache(keyword)
return {
"status": "partial",
"results": results,
"note": "Using cached data. Results may be stale."
}
except:
pass
# Last resort: return empty but don't crash
return {
"status": "failed",
"results": [],
"error": "Both primary and fallback failed. No grants found."
}
def _query_grants_db(self, keyword, max_budget):
# Simulated DB query
raise Exception("Database connection timeout")
def _query_grants_cache(self, keyword):
# Fallback to cached data
return [{"name": "AWS Imagine Grant", "budget": 50000}]
def step_with_tools(self, user_input: str) -> str:
"""Execute agent step, handling tool failures."""
self.messages.append({"role": "user", "content": user_input})
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system="You are a grant researcher with access to search_grants tool.",
tools=self.tools,
messages=self.messages
)
# Check if agent is done (no tool calls)
if not any(block.type == "tool_use" for block in response.content):
text = next(
(block.text for block in response.content if hasattr(block, "text")),
"No response"
)
self.messages.append({"role": "assistant", "content": response.content})
return text
# Handle tool calls
for block in response.content:
if block.type == "tool_use":
tool_result = self.execute_tool(block.name, block.input)
self.messages.append({"role": "assistant", "content": response.content})
self.messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(tool_result)
}]
})
# Usage
agent = ResilientToolAgent()
result = agent.step_with_tools("Find AI grants under $50K")
print(result)
Prevents: One failed API call from breaking your entire agent. Fallbacks ensure agent completes task (even if degraded).
Pattern 5: Observability — Logging Every Decision
If an agent misbehaves in production, you need to know exactly what it did. Log every API call, tool invocation, and decision point.
import json
import logging
from datetime import datetime
from anthropic import Anthropic
client = Anthropic()
# Structured logging for agents
logger = logging.getLogger("agent")
handler = logging.FileHandler("agent_audit.log")
formatter = logging.Formatter(
json.dumps({"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"})
)
handler.setFormatter(formatter)
logger.addHandler(handler)
class ObservableAgent:
def __init__(self, agent_id: str):
self.agent_id = agent_id
self.messages = []
self.session_id = datetime.now().isoformat()
def log_event(self, event_type: str, details: dict):
"""Log agent action."""
event = {
"session_id": self.session_id,
"agent_id": self.agent_id,
"event_type": event_type,
"timestamp": datetime.now().isoformat(),
**details
}
logger.info(json.dumps(event))
def step(self, system_prompt: str, user_input: str) -> str:
"""Step with full audit trail."""
self.log_event("step_start", {"user_input": user_input[:100]})
self.messages.append({"role": "user", "content": user_input})
try:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system=system_prompt,
messages=self.messages
)
assistant_message = response.content[0].text
self.messages.append({"role": "assistant", "content": assistant_message})
# Log API usage
self.log_event("api_call", {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"model": "claude-opus-4-6",
"response_length": len(assistant_message)
})
return assistant_message
except Exception as e:
self.log_event("step_failed", {"error": str(e), "error_type": type(e).__name__})
raise
# Usage
agent = ObservableAgent(agent_id="research_agent_001")
agent.step(
system_prompt="You research grants.",
user_input="What AI grants exist?"
)
# Log file now contains:
# {"session_id": "2026-06-27T...", "agent_id": "research_agent_001", "event_type": "step_start", ...}
# {"session_id": "2026-06-27T...", "agent_id": "research_agent_001", "event_type": "api_call", ...}
For debugging: Parse agent_audit.log to understand exactly what happened. Which tool failed? When? With what input?
Deployment Checklist
Before shipping any agent to production:
- ✅ State checkpointing: Agent survives process restart
- ✅ Token budgets: Runaway agent can't cost $1000
- ✅ Validation loop: Semantic failures trigger retry
- ✅ Tool fallbacks: Failed API = degraded mode, not crash
- ✅ Audit logging: Every decision is recorded
- ✅ Monitoring: Alert on failed validations or budget overages
- ✅ Rollback plan: How to disable agent if it misbehaves
- ✅ Cost tracking: Graph tokens/month to catch trends
The Future: Self-Healing Agents
These patterns keep agents alive. The next frontier is agents that fix themselves: detecting failures, adjusting their strategy, and learning what works.
We're experimenting with agents that maintain a "failure journal" — every mistake builds a knowledge base of what NOT to do. Your agent becomes smarter the longer it runs.
For now: implement state checkpointing and token budgets. Those two patterns alone prevent 80% of production agent problems.