Error Recovery & Resilience
Screens change. Pages break. Popups ambush. Your agent needs to survive all of it.
The difference between a demo and a production agent is how it handles failure. This lesson teaches retry strategies, fallback actions, visual assertions, and the patterns that keep agents running when everything goes sideways.
What you'll learn
- Common failure modes in visual automation and how to detect them
- Retry strategies: exponential backoff, alternative paths, graceful degradation
- Visual assertions: verifying screen state before and after actions
- Building agents that recover from errors without human intervention
Why Visual Agents Break
A text-based API either works or returns an error code. Visual automation is messier. The screen is a living, changing environment. Here are the ways it breaks:
Unexpected popups. Cookie consent banners, newsletter signup modals, chat widgets, browser notifications, system updates. Any of these can appear between your screenshot and your click, intercepting the action.
Layout shifts. Content loading asynchronously pushes elements around. The button was at (500, 300) in the screenshot but by the time the agent clicks, an ad loaded above it and pushed it to (500, 450). The click hits empty space.
Session expiration. The agent is mid-workflow when the website session expires. The next click redirects to a login page instead of the expected page. The entire workflow context is lost.
CAPTCHA challenges. Anti-bot systems detect automated behavior and present CAPTCHAs. The vision agent can see the CAPTCHA but cannot reliably solve it. This is a hard blocker that requires a different strategy.
The Error Recovery Framework
Every resilient agent follows this framework when something goes wrong:
1. Detect. The agent takes a screenshot after every action and compares the result to what was expected. If the result does not match -- the page did not change, an error message appeared, an unexpected page loaded -- the agent knows something went wrong.
2. Diagnose. What kind of error is it? A popup blocking the view (dismiss it). A session timeout (re-login). A page that did not load (wait and retry). A CAPTCHA (escalate to human). The diagnosis determines the recovery strategy.
3. Recover. Execute the appropriate recovery action. Dismiss the popup, re-authenticate, wait for the page, or escalate. Then verify recovery was successful before resuming the original workflow.
4. Resume. Return to the workflow at the point of failure. If the recovery changed the page state (like re-logging in), the agent may need to navigate back to where it was. Track workflow progress so resumption is possible.
Retry Strategies
Not all retries are equal. The right strategy depends on the type of failure:
Simple retry. The action failed for a transient reason (network hiccup, slow page load). Wait 1 second and try the exact same action again. Works for: temporary loading delays, flaky network connections.
Exponential backoff. If the first retry fails, wait 2 seconds. Then 4. Then 8. Cap at 30 seconds. This prevents hammering a slow or overloaded server. Works for: server errors, rate limiting, slow infrastructure.
Alternative path. The primary approach failed, so try a different route to the same goal. Cannot click the menu item? Try the keyboard shortcut. Cannot find the Settings link? Navigate directly via URL. Works for: UI changes, hidden elements, broken navigation.
Graceful degradation. The full task cannot be completed, but a partial result is still valuable. Cannot fill all 10 fields because one has an unexpected dropdown? Fill the other 9 and flag the problematic field for human attention. Works for: partially broken pages, unsupported UI elements.
// Retry with exponential backoff
async function retryWithBackoff(action, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const result = await action();
if (result.success) return result;
// Wait with exponential backoff: 1s, 2s, 4s
const waitTime = Math.pow(2, attempt) * 1000;
console.log(`Attempt ${attempt + 1} failed. Waiting ${waitTime}ms...`);
await wait(waitTime);
}
// All retries exhausted -- escalate or degrade gracefully
return { success: false, error: 'Max retries exceeded' };
}This lesson is for Pro members
Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.
Already a member? Sign in to access your lessons.