Browser Agent Architecture
The full stack: Chrome DevTools Protocol, page context, and when to use vision vs. DOM.
Computer use gives you eyes. Browser automation gives you hands that reach inside the page. The best agents combine both. This lesson teaches you how to architect a browser agent that uses the right approach for each situation.
What you'll learn
- Chrome DevTools Protocol (CDP): programmatic browser control
- DOM-based interaction vs. visual interaction: tradeoffs and use cases
- Hybrid architecture: combining screenshots with page context
- Building a full browser agent from scratch
Two Ways to See a Web Page
A web page exists in two forms simultaneously. The visual form is what you see on screen -- pixels, colors, layout, text. The structural form is the DOM (Document Object Model) -- the HTML tree that describes every element, its properties, and its relationships.
Computer use interacts with the visual form: take a screenshot, identify elements, click at coordinates. This is universal -- it works on any interface, web or desktop. But it is slow (3-10 seconds per action) and approximate (coordinates are estimated).
DOM-based automation interacts with the structural form: select an element by CSS selector, read its text, click it programmatically. This is fast (milliseconds), precise (no coordinate guessing), and reliable (selectors target specific elements). But it only works for web pages, and it requires knowledge of the page structure.
A well-architected browser agent uses both. DOM for speed and precision when the page structure is known. Vision for flexibility and universality when it is not.
Chrome DevTools Protocol
Chrome DevTools Protocol (CDP) is the API that Chrome exposes for programmatic control. Tools like Puppeteer, Playwright, and Chrome extensions all use CDP under the hood. Understanding it gives you direct access to the browser's capabilities:
Page navigation. Load URLs, go back/forward, handle redirects. Faster than typing URLs via computer use.
DOM access. Query elements with CSS selectors or XPath. Read text content, attributes, computed styles. Modify the DOM directly if needed.
JavaScript execution. Run arbitrary JavaScript in the page context. Fill forms, trigger events, extract data, or manipulate page state.
Network interception. Monitor HTTP requests and responses. Block ads and trackers. Modify request headers. Intercept API calls for data extraction.
// Playwright example: DOM-based interaction
import { chromium } from 'playwright';
// Launch a browser
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();
// Navigate (faster than typing URL via computer use)
await page.goto('https://example.com/login');
// Fill form using CSS selectors (precise, no coordinate guessing)
await page.fill('#email', 'user@example.com'); // Find by ID
await page.fill('#password', 'secure123'); // Find by ID
await page.click('button[type="submit"]'); // Find by attribute
// Wait for navigation (built-in, no manual screenshot checking)
await page.waitForURL('**/dashboard');
// Extract data (direct DOM access, no OCR needed)
const userName = await page.textContent('.user-name');
console.log(`Logged in as: ${userName}`);When to Use Vision vs. DOM
The decision tree is straightforward:
Use DOM when: You know the page structure. You have CSS selectors that reliably target elements. The page is a standard web application. Speed matters. You need pixel-perfect precision.
Use Vision when: You do not know the page structure in advance. The page uses heavy JavaScript frameworks that make DOM querying unreliable. The interface is not web-based (desktop app, terminal). You need to verify what the user actually sees. The page changes frequently and selectors break.
Use both when: You want speed AND visual verification. Navigate via DOM (fast), then take a screenshot to verify the page looks correct (reliable). Fill forms via DOM, then screenshot to confirm the values display properly. This hybrid approach gives you the best of both worlds.
The Hybrid Architecture
The most powerful browser agents layer three capabilities:
Layer 1: DOM (fast lane). For known pages with stable selectors: navigate, fill forms, click buttons, extract data. Millisecond execution. No screenshots needed.
Layer 2: Vision (fallback). When DOM interactions fail or the page is unknown: take a screenshot, ask Claude what is on screen, get coordinates, act visually. Slower but universal.
Layer 3: Page context (intelligence). Read the full HTML, pass it to Claude alongside a screenshot, and let the AI understand both the visual appearance AND the underlying structure. This gives the richest understanding for complex decisions.
This lesson is for Pro members
Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.
Already a member? Sign in to access your lessons.