Designing Agentic Loops That Don't Break

Lesson Content

After this lesson you'll know

  • The anatomy of an agentic loop (observe β†’ think β†’ act β†’ evaluate)
  • Five stopping conditions that prevent infinite loops
  • Error recovery patterns the exam tests
  • How to design loops that scale from 1 to 1000 iterations

The Agentic Loop

Every agent runs in a loop. The loop is the heartbeat of agentic architecture. Get it wrong and your agent either stops too early (incomplete work) or runs forever (infinite loop, burned tokens, broken systems).

The CCA exam tests whether you can design loops with appropriate stopping conditions, error recovery, and resource awareness.

THE AGENTIC LOOP:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚   β”‚ OBSERVE  │───▢│  THINK   β”‚         β”‚
β”‚   β”‚ (tools)  β”‚    β”‚ (reason) β”‚         β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β”‚        β–²               β”‚               β”‚
β”‚        β”‚               β–Ό               β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚   β”‚ EVALUATE │◀───│   ACT    β”‚         β”‚
β”‚   β”‚ (check)  β”‚    β”‚ (tools)  β”‚         β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚        β”‚                                β”‚
β”‚        β–Ό                                β”‚
β”‚   [CONTINUE?] ──── no ────▢ [DONE]     β”‚
β”‚        β”‚                                οΏ½οΏ½οΏ½
β”‚       yes                               β”‚
β”‚        β”‚                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
                                          β”‚

Five Stopping Conditions

The exam will test whether you know when an agent should STOP. These are the five canonical stopping conditions:

1. Goal Achieved

The agent completed its objective. This is the happy path.

# Goal: "Fix the failing test"
# Stop when: test passes
result = run_tests()
if result.all_passed:
    return "Done β€” all tests passing"

2. Maximum Iterations

Safety valve. Prevents infinite loops regardless of goal state.

# Hard limit: never exceed N iterations
MAX_ITERATIONS = 25
for i in range(MAX_ITERATIONS):
    result = agent_step()
    if result.done:
        return result
return "Reached iteration limit β€” reporting partial progress"
Exam tip: The exam often presents scenarios where the "right" answer includes a maximum iteration limit even when other stopping conditions exist. Defense in depth.

3. Error Threshold

Stop after N consecutive failures. Don't burn resources on a broken state.

# 3-strike rule
consecutive_errors = 0
MAX_ERRORS = 3

while not done:
    try:
        result = agent_step()
        consecutive_errors = 0  # reset on success
    except Exception as e:
        consecutive_errors += 1
        if consecutive_errors >= MAX_ERRORS:
            return f"Stopping: {MAX_ERRORS} consecutive errors. Last: {e}"

4. Resource Budget

Token limits, time limits, cost limits. The agent is aware of its budget.

# Token-aware agent
token_budget = 100_000
tokens_used = 0

while tokens_used < token_budget and not done:
    result = agent_step()
    tokens_used += result.tokens_consumed

if not done:
    return "Budget exhausted β€” partial results available"

5. Human Intervention Required

The agent recognizes it needs human input and pauses gracefully.

# Escalation: when the agent can't decide
if confidence < 0.5 or action_is_destructive:
    return "Pausing: need human approval for this action"
Production pattern: Real systems use ALL FIVE simultaneously. Goal achieved is the happy path. The others are safety nets. The CCA exam rewards answers that include multiple stopping conditions.

Error Recovery Patterns

The exam tests three error recovery patterns:

Pattern 1: Retry with Backoff

async def retry_with_backoff(fn, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await fn()
        except TransientError:
            wait = 2 ** attempt  # 1s, 2s, 4s
            await asyncio.sleep(wait)
    raise MaxRetriesExceeded()

When to use: Transient failures (network timeouts, rate limits, temporary API errors).

Pattern 2: Fallback Chain

async def with_fallback(primary_fn, fallback_fn):
    try:
        return await primary_fn()
    except Exception:
        return await fallback_fn()

# Example: try Claude Opus, fall back to Sonnet
result = await with_fallback(
    lambda: call_opus(prompt),
    lambda: call_sonnet(prompt)
)

When to use: When alternative approaches exist (different models, different tools, cached results).

Pattern 3: Graceful Degradation

# If brain search fails, fall back to basic search
if brain.is_available:
    results = brain.hybrid_search(query)
else:
    results = basic_string_search(query)  # degraded but functional

When to use: When partial results are better than no results. The system continues with reduced capability.

Exam pattern: "The MCP server is returning errors. What should the agent do?" β€” The answer is almost always graceful degradation or retry with backoff. Never "crash" or "ignore the error."

Loop Architecture Patterns

Simple Sequential Loop

# Best for: clear step-by-step tasks
steps = [read_file, analyze, write_fix, run_tests]
for step in steps:
    result = step()
    if result.error:
        return handle_error(result)

Iterative Refinement Loop

# Best for: quality improvement (code review, writing, optimization)
MAX_ITERATIONS = 5
for i in range(MAX_ITERATIONS):
    output = generate(prompt, context)
    evaluation = evaluate(output, criteria)
    if evaluation.meets_criteria:
        return output
    context = update_context(context, evaluation.feedback)

Event-Driven Loop

# Best for: monitoring, long-running agents
while running:
    event = await wait_for_event(timeout=60)
    if event:
        result = process_event(event)
        if result.requires_action:
            await take_action(result)

Map-Reduce Loop (Parallel)

# Best for: processing multiple items independently
items = get_work_items()
results = await asyncio.gather(*[
    process_item(item) for item in items
])
summary = reduce(results)

Anti-Patterns the Exam Tests

The exam presents anti-patterns as wrong answers. Know them:

Anti-Pattern Why It's Wrong Correct Pattern
No stopping condition Infinite loop risk Always set max iterations
Retry forever Wastes resources on permanent failures Retry with limit + backoff
Swallow errors silently Hides bugs, produces wrong output Log + escalate + degrade gracefully
Restart from scratch on error Loses progress, wastes tokens Resume from last good state
No progress tracking Can't tell if loop is stuck Track iterations + compare states

Real-World Example: Our Production Loop

Here's an actual agentic loop from our production system (the twin that built this course):

# Claude Code session loop (simplified)
MAX_TOOL_CALLS = 300
tool_calls = 0

while tool_calls < MAX_TOOL_CALLS:
    # OBSERVE: read brain for next task
    next_task = brain.read("session.next_steps")

    # THINK: plan approach
    plan = reason_about(next_task)

    # ACT: execute via tools
    result = execute_tools(plan)
    tool_calls += result.tool_count

    # EVALUATE: did it work?
    if result.success:
        brain.write("session.active_work", result.summary)
        # Check if more work exists
        if no_more_work():
            break  # Goal achieved
    else:
        # Error recovery
        if result.retryable:
            continue  # retry
        else:
            brain.write("blocker", result.error)
            break  # escalate

# Stopping conditions: goal achieved OR max tools OR non-retryable error
brain.write("session.next_steps", plan_next_session())

This loop has all five stopping conditions: goal (no more work), iterations (MAX_TOOL_CALLS), error threshold (non-retryable), budget (implicit in token usage), and human escalation (blockers written to brain).

Quick Check

1An agent has been running for 50 iterations without making progress. The goal is not achieved. What should happen?

2Which error recovery pattern is best for API rate limit errors?