Multi-Step Workflows

One click is a trick. Twenty clicks in sequence across three pages? That is automation.

Real-world tasks are not single actions. They are chains: log in, navigate to settings, update a field, save, confirm, export. This lesson teaches you to build visual workflows that chain reliably across multiple pages and states.

What you'll learn

How to decompose complex tasks into visual action chains
State management across multiple pages and transitions
Handling page loads, redirects, and dynamic content between steps
Building reusable workflow templates for common patterns

Concept

Task Decomposition

Every multi-step workflow starts as a human description: "Book a flight from SFO to JFK for next Friday." But a vision agent cannot execute that as one action. It needs to break it down into atomic steps:

1. Navigate to the airline website. 2. Find the search form. 3. Enter departure city. 4. Enter destination city. 5. Select the date. 6. Click search. 7. Wait for results. 8. Compare options. 9. Select a flight. 10. Fill passenger details. 11. Enter payment. 12. Confirm booking.

Each of these is a screenshot-analyze-act cycle. The key insight is that the agent does not need to plan all 12 steps upfront. It needs to know the goal and execute one step at a time, using the current screenshot to decide the next action. This is how humans do it too -- you do not memorize every click before starting.

Architecture

The Workflow State Machine

A multi-step workflow is a state machine. Each state represents what the agent sees on screen, and transitions are the actions that move between states:

Login Page ──[enter credentials, click login]──> Dashboard
Dashboard ──[click Settings]──> Settings Page
Settings Page ──[update email, click Save]──> Confirmation Modal
Confirmation Modal ──[click Confirm]──> Settings Page (updated)

Each transition:
1. Verify current state (screenshot matches expected page)
2. Execute action (click, type, scroll)
3. Wait for transition (page load, animation)
4. Verify new state (screenshot matches expected result)

State verification is the critical step most automation skips. Before acting, check that you are on the expected page. After acting, verify you arrived at the expected result. This catches navigation errors, unexpected popups, and session timeouts before they cascade.

Implementation

Handling Page Transitions

The gap between clicking a link and the new page loading is where most workflow failures occur. The agent clicks, takes a screenshot immediately, and sees a blank page or loading spinner. Then it gets confused.

Wait for stability. After clicking a navigation link, wait 1-3 seconds before taking the next screenshot. This gives the page time to load. For slow sites, increase to 5 seconds.

Check for loading indicators. Take a screenshot and ask Claude: "Is this page fully loaded or still loading?" Look for spinners, progress bars, skeleton screens, or "Loading..." text. If the page is still loading, wait and retry.

Verify arrival. After the page loads, confirm you are on the expected page. Look for page titles, headings, or unique elements that identify the page. "I expected to see the Settings page. The screenshot shows a heading that says Account Settings. Confirmed."

Handle redirects. Some clicks trigger redirects through multiple URLs before landing on the final page. The agent should not act during redirects -- it should wait for the final destination. Take screenshots every 2 seconds until the page stabilizes.

🔒

This lesson is for Pro members

Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.

Go Pro — $4.90/mo ← Back to course

Already a member? Sign in to access your lessons.