Evaluating Your Agent
An agent that works in demo does not always work in production. Here are the five dimensions you must measure, the code to measure them, and the thresholds that separate "deployable" from "dangerous."
The Five Evaluation Dimensions
Rate your agent on each dimension. An agent must score above 70 on ALL five to be production-ready. One weak dimension can sink the entire experience:
How often does the agent give correct, useful responses? Measure by running a test suite of known questions with expected answers.
How quickly does the agent complete tasks? Users expect responses within 5-10 seconds for simple queries, 30 seconds for complex multi-tool tasks.
Does the agent work consistently without crashes or silent failures? Measure error rate over 100+ runs.
How much does each agent interaction cost? At scale, a $0.50 interaction that should cost $0.05 will kill your budget.
Are users happy? An agent can be technically correct but still frustrating if the tone is wrong, the format is confusing, or it does not explain its limitations.
Building a Test Suite
You cannot evaluate what you do not measure. Here is how to build an automated test suite for your agent:
import time
TEST_CASES = [
{
"input": "What plan is jane@acme.co on?",
"expected_tool": "lookup_customer",
"expected_contains": ["Pro", "$49"],
},
{
"input": "How do I reset my password?",
"expected_tool": "search_knowledge_base",
"expected_contains": ["settings", "reset"],
},
{
"input": "I need help", # Ambiguous
"expected_tool": None, # Should ask for clarification
"expected_contains": ["what", "help"],
},
]
def evaluate_agent(agent_fn):
results = []
for case in TEST_CASES:
start = time.time()
try:
response = agent_fn(case["input"])
elapsed = time.time() - start
results.append({
"input": case["input"],
"passed": all(kw.lower() in response.lower()
for kw in case["expected_contains"]),
"time_s": round(elapsed, 2),
"error": None
})
except Exception as e:
results.append({
"input": case["input"],
"passed": False,
"error": str(e)
})
# Calculate scores
accuracy = sum(r["passed"] for r in results) / len(results)
avg_time = sum(r.get("time_s", 0) for r in results) / len(results)
errors = sum(1 for r in results if r["error"])
print(f"Accuracy: {accuracy*100:.0f}%")
print(f"Avg time: {avg_time:.1f}s")
print(f"Errors: {errors}/{len(results)}")
Common Evaluation Traps
This lesson is for Pro members
Unlock all 520+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.