Computer Use API

Lesson Content

What you'll learn

  • How Claude's computer use tool works under the hood
  • The coordinate system: how the AI maps pixels to actions
  • Setting up your first computer-use session with the Anthropic API
  • The screenshot-action loop in practice

The Computer Use Tool

The Computer Use Tool
01ScreenshotClaude sees the screen
02DecideIdentifies what to click/type
03ActSends mouse/keyboard action
The screenshot-action loop.

Claude's computer use is a special tool -- like web search or code execution, but for interacting with a graphical interface. When you enable it, Claude gains four capabilities:

Screenshot. Capture the current state of the screen as an image. This is the AI's vision -- it sees exactly what is displayed, pixel by pixel.

Click. Move the cursor to specific x,y coordinates and click (left, right, double, or middle click). This is how the AI presses buttons, selects options, and interacts with elements.

Type. Send keystrokes to the active element. This is how the AI fills in forms, enters search queries, and writes text into any input field.

Scroll. Scroll up or down on the page. This is how the AI reaches content below the fold, navigates long pages, and reveals hidden elements.

The Coordinate System

When the AI sees a screenshot, it needs to know where things are so it can click accurately. The coordinate system is straightforward:

Origin (0,0) is the top-left corner of the screen. X increases going right. Y increases going down. If your screen is 1920x1080 pixels, the bottom-right corner is (1920, 1080).

Screen resolution matters. A button might be at (500, 300) on a 1920x1080 display but at (250, 150) on a 960x540 display. Always know your resolution and communicate it to the AI so coordinates are accurate.

Center of the element. When clicking a button, aim for the center, not the edge. A 200x50 pixel button at position (400, 300) should be clicked at approximately (500, 325) -- the center point. This gives the most reliable hits.

Your First Computer Use Session

Here is the minimal code to start a computer-use session with Claude. Every line is commented so you understand exactly what is happening:

Setting Up the API Call

The computer use tool is passed as part of the tools array in your API request. Here is the structure:

// Import the Anthropic SDK import Anthropic from '@anthropic-ai/sdk'; // Create the client with your API key const client = new Anthropic(); // Send a message with computer use enabled const response = await client.messages.create({ model: 'claude-sonnet-4-20250514', // Model with vision capabilities max_tokens: 4096, // Room for the AI to think and act tools: [{ type: 'computer_20250124', // The computer use tool type name: 'computer', // Tool name display_width_px: 1920, // Your screen width in pixels display_height_px: 1080, // Your screen height in pixels display_number: 0 // Which display (0 = primary) }], messages: [{ role: 'user', content: 'Take a screenshot and tell me what you see.' }] });

When Claude responds, it will request a tool use action -- either taking a screenshot first, or if you provide one, immediately suggesting a click/type/scroll action. You then execute that action on the actual screen and send back the result.

The Execution Loop

Computer use is a conversation, not a single call. The pattern looks like this:

Step 1: You send a task to Claude with the computer use tool enabled. Claude responds with a tool_use block requesting a screenshot.

Step 2: You capture a screenshot of the actual screen, encode it as base64, and send it back as a tool_result.

Step 3: Claude analyzes the screenshot and responds with the next action -- click at (x, y), type "hello", or scroll down. You execute that action on the real screen.

Step 4: You take another screenshot showing the result of the action and send it back. Claude decides the next step. The loop continues until the task is complete.

// The execution loop (simplified pseudocode) while (task_not_complete) { // 1. Get Claude's next action const response = await client.messages.create({ model: 'claude-sonnet-4-20250514', tools: [computerTool], messages: conversationHistory }); // 2. Extract the tool use from the response const toolUse = response.content.find(b => b.type === 'tool_use'); if (toolUse.input.action === 'screenshot') { // 3a. Capture screenshot, encode as base64 const screenshot = await captureScreen(); conversationHistory.push({ role: 'tool', content: [{ type: 'image', source: { data: screenshot } }] }); } else if (toolUse.input.action === 'click') { // 3b. Click at the specified coordinates await clickAt(toolUse.input.coordinate[0], toolUse.input.coordinate[1]); // Take a screenshot to show the result const screenshot = await captureScreen(); conversationHistory.push({ role: 'tool', content: [{ type: 'image', source: { data: screenshot } }] }); } else if (toolUse.input.action === 'type') { // 3c. Type the specified text await typeText(toolUse.input.text); const screenshot = await captureScreen(); conversationHistory.push({ role: 'tool', content: [{ type: 'image', source: { data: screenshot } }] }); } }

Understanding the computer use flow.

Running Computer Use Safely

Computer use gives AI control of your mouse and keyboard. Safety is not optional. Here are the rules:

Use a sandboxed environment. Never run computer use on your primary desktop. Use a virtual machine (VM), a Docker container with a virtual display, or a cloud instance. If the AI clicks something wrong, it affects the sandbox, not your real machine.

Start with observation only. Before letting the agent click or type, have it take screenshots and describe what it sees. Verify that its understanding matches reality. Then enable actions one at a time.

Set action limits. Cap the number of actions per session -- start with 20. An agent in a confused loop can click thousands of times. Action limits prevent runaway behavior.

Log everything. Record every screenshot and every action. This creates an audit trail for debugging and a training dataset for improvement. You will learn how to build GIF-based audit trails in Lesson 9.

Common Setup Mistakes

Wrong resolution. Telling the API your screen is 1920x1080 when it is actually 2560x1440. Every click will miss its target by a wide margin. Always measure and report the actual resolution of your virtual display.

No screenshot after action. Clicking a button but not sending a screenshot of the result. The AI has no idea what happened. Always capture and send a screenshot after every action so the AI can verify the result and plan the next step.

Running on the main desktop. Giving the AI control of your actual computer. One wrong click could open your email, send a message, or delete files. Always use a sandbox. This is non-negotiable.

Try It Yourself

Set up your first computer use environment:

Option 1 (Docker): Use Anthropic's reference container docker run -p 5900:5900 ghcr.io/anthropics/anthropic-quickstarts:computer-use Option 2 (Local VM): Use VirtualBox or UTM with a Linux desktop Set the display to 1920x1080 for consistent coordinates Option 3 (Cloud): Spin up a GCP/AWS instance with a desktop environment Use VNC or noVNC for remote access to the virtual display Once running, send your first screenshot-only request to Claude. Verify: does the AI correctly describe what is on screen?

Key concepts.

Computer Use API

The Four Computer Use Actions
Screenshot (capture the screen), Click (move cursor and click at x,y coordinates), Type (send keystrokes to active element), Scroll (scroll up or down on the page).
The Coordinate System
Origin (0,0) is the top-left corner. X increases rightward, Y increases downward. Always report actual screen resolution to the API for accurate targeting.
The Execution Loop
Send task -> Claude requests screenshot -> You capture and send it -> Claude suggests action -> You execute and send new screenshot -> Repeat until task complete.
Sandbox Rule
NEVER run computer use on your primary desktop. Always use a VM, Docker container, or cloud instance. One wrong click on your real machine can cause real damage.
Screenshot After Every Action
Always capture and send a screenshot after every click, type, or scroll action. Without it, the AI has no way to verify what happened and cannot plan the next step.
Action Limits
Cap the number of actions per session (start with 20). Prevents runaway behavior if the agent enters a confused loop. Increase gradually as you build confidence.

Computer use API quiz.

Computer Use API

1What are the four actions available through Claude computer use?

2Why must you always send a screenshot after executing an action?

3Why should you never run computer use on your primary desktop?