Evaluating Your Agent

An agent that works in demo does not always work in production. Here are the five dimensions you must measure, the code to measure them, and the thresholds that separate "deployable" from "dangerous."

The Five Evaluation Dimensions

Rate your agent on each dimension. An agent must score above 70 on ALL five to be production-ready. One weak dimension can sink the entire experience:

1. Accuracy

How often does the agent give correct, useful responses? Measure by running a test suite of known questions with expected answers.

Improve: Better system prompts, few-shot examples, output validation, tool result verification.

2. Speed

How quickly does the agent complete tasks? Users expect responses within 5-10 seconds for simple queries, 30 seconds for complex multi-tool tasks.

Improve: Caching frequent queries, parallel tool calls, routing simple tasks to faster models.

3. Reliability

Does the agent work consistently without crashes or silent failures? Measure error rate over 100+ runs.

Improve: Retry logic, fallback tools, comprehensive error handling, dead letter queues.

4. Cost Efficiency

How much does each agent interaction cost? At scale, a $0.50 interaction that should cost $0.05 will kill your budget.

Improve: Caching, token limits, tiered model routing (fast model for simple tasks, powerful model for complex).

5. User Satisfaction

Are users happy? An agent can be technically correct but still frustrating if the tone is wrong, the format is confusing, or it does not explain its limitations.

Improve: Feedback collection, tone adjustments, clearer output formatting, transparency about limitations.

Building a Test Suite

You cannot evaluate what you do not measure. Here is how to build an automated test suite for your agent:

      # eval.py — Agent evaluation framework

      import time

      TEST_CASES = [

        {

          "input": "What plan is jane@acme.co on?",

          "expected_tool": "lookup_customer",

          "expected_contains": ["Pro", "$49"],

        },

        {

          "input": "How do I reset my password?",

          "expected_tool": "search_knowledge_base",

          "expected_contains": ["settings", "reset"],

        },

        {

          "input": "I need help",  # Ambiguous

          "expected_tool": None,  # Should ask for clarification

          "expected_contains": ["what", "help"],

        },

      ]

      def evaluate_agent(agent_fn):

        results = []

        for case in TEST_CASES:

          start = time.time()

          try:

            response = agent_fn(case["input"])

            elapsed = time.time() - start

            results.append({

              "input": case["input"],

              "passed": all(kw.lower() in response.lower()

                        for kw in case["expected_contains"]),

              "time_s": round(elapsed, 2),

              "error": None

            })

          except Exception as e:

            results.append({

              "input": case["input"],

              "passed": False,

              "error": str(e)

            })

        # Calculate scores

        accuracy = sum(r["passed"] for r in results) / len(results)

        avg_time = sum(r.get("time_s", 0) for r in results) / len(results)

        errors = sum(1 for r in results if r["error"])

        print(f"Accuracy: {accuracy*100:.0f}%")

        print(f"Avg time: {avg_time:.1f}s")

        print(f"Errors: {errors}/{len(results)}")

Common Evaluation Traps

Testing only the happy path — Your test suite only has clear, well-formed queries. Add ambiguous inputs, edge cases, and adversarial prompts. Real users are messy.

Ignoring speed — An agent that takes 45 seconds per response will frustrate users even if every answer is perfect. Measure latency on every test case.

Measuring once — LLM outputs are non-deterministic. Run each test case 3-5 times and measure the spread. An agent that passes 3/5 times is 60% reliable, not 100%.

No production monitoring — Pre-launch testing is not enough. Log every production interaction and review failure cases weekly. Accuracy drifts over time as user patterns change.

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.