Computer Use API

What you'll learn

How Claude's computer use tool works under the hood
The coordinate system: how the AI maps pixels to actions
Setting up your first computer-use session with the Anthropic API
The screenshot-action loop in practice

Foundation

The Computer Use Tool

01ScreenshotClaude sees the screen

→

02DecideIdentifies what to click/type

→

03ActSends mouse/keyboard action

The screenshot-action loop.

Claude's computer use is a special tool -- like web search or code execution, but for interacting with a graphical interface. When you enable it, Claude gains four capabilities:

Screenshot. Capture the current state of the screen as an image. This is the AI's vision -- it sees exactly what is displayed, pixel by pixel.

Click. Move the cursor to specific x,y coordinates and click (left, right, double, or middle click). This is how the AI presses buttons, selects options, and interacts with elements.

Type. Send keystrokes to the active element. This is how the AI fills in forms, enters search queries, and writes text into any input field.

Scroll. Scroll up or down on the page. This is how the AI reaches content below the fold, navigates long pages, and reveals hidden elements.

Architecture

The Coordinate System

When the AI sees a screenshot, it needs to know where things are so it can click accurately. The coordinate system is straightforward:

Origin (0,0) is the top-left corner of the screen. X increases going right. Y increases going down. If your screen is 1920x1080 pixels, the bottom-right corner is (1920, 1080).

Screen resolution matters. A button might be at (500, 300) on a 1920x1080 display but at (250, 150) on a 960x540 display. Always know your resolution and communicate it to the AI so coordinates are accurate.

Center of the element. When clicking a button, aim for the center, not the edge. A 200x50 pixel button at position (400, 300) should be clicked at approximately (500, 325) -- the center point. This gives the most reliable hits.

Your First Computer Use Session

Here is the minimal code to start a computer-use session with Claude. Every line is commented so you understand exactly what is happening:

Code

Setting Up the API Call

The computer use tool is passed as part of the tools array in your API request. Here is the structure:

// Import the Anthropic SDK
import Anthropic from '@anthropic-ai/sdk';

// Create the client with your API key
const client = new Anthropic();

// Send a message with computer use enabled
const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',           // Model with vision capabilities
  max_tokens: 4096,                    // Room for the AI to think and act
  tools: [{
    type: 'computer_20250124',         // The computer use tool type
    name: 'computer',                  // Tool name
    display_width_px: 1920,            // Your screen width in pixels
    display_height_px: 1080,           // Your screen height in pixels
    display_number: 0                  // Which display (0 = primary)
  }],
  messages: [{
    role: 'user',
    content: 'Take a screenshot and tell me what you see.'
  }]
});

When Claude responds, it will request a tool use action -- either taking a screenshot first, or if you provide one, immediately suggesting a click/type/scroll action. You then execute that action on the actual screen and send back the result.

Implementation

The Execution Loop

Computer use is a conversation, not a single call. The pattern looks like this:

Step 1: You send a task to Claude with the computer use tool enabled. Claude responds with a tool_use block requesting a screenshot.

Step 2: You capture a screenshot of the actual screen, encode it as base64, and send it back as a tool_result.

Step 3: Claude analyzes the screenshot and responds with the next action -- click at (x, y), type "hello", or scroll down. You execute that action on the real screen.

Step 4: You take another screenshot showing the result of the action and send it back. Claude decides the next step. The loop continues until the task is complete.

// The execution loop (simplified pseudocode)
while (task_not_complete) {
  // 1. Get Claude's next action
  const response = await client.messages.create({
    model: 'claude-sonnet-4-20250514',
    tools: [computerTool],
    messages: conversationHistory
  });

  // 2. Extract the tool use from the response
  const toolUse = response.content.find(b => b.type === 'tool_use');

  if (toolUse.input.action === 'screenshot') {
    // 3a. Capture screenshot, encode as base64
    const screenshot = await captureScreen();
    conversationHistory.push({
      role: 'tool',
      content: [{ type: 'image', source: { data: screenshot } }]
    });
  } else if (toolUse.input.action === 'click') {
    // 3b. Click at the specified coordinates
    await clickAt(toolUse.input.coordinate[0], toolUse.input.coordinate[1]);
    // Take a screenshot to show the result
    const screenshot = await captureScreen();
    conversationHistory.push({
      role: 'tool',
      content: [{ type: 'image', source: { data: screenshot } }]
    });
  } else if (toolUse.input.action === 'type') {
    // 3c. Type the specified text
    await typeText(toolUse.input.text);
    const screenshot = await captureScreen();
    conversationHistory.push({
      role: 'tool',
      content: [{ type: 'image', source: { data: screenshot } }]
    });
  }
}

Practice

Understanding the computer use flow.

Environment

Running Computer Use Safely

Computer use gives AI control of your mouse and keyboard. Safety is not optional. Here are the rules:

Use a sandboxed environment. Never run computer use on your primary desktop. Use a virtual machine (VM), a Docker container with a virtual display, or a cloud instance. If the AI clicks something wrong, it affects the sandbox, not your real machine.

Start with observation only. Before letting the agent click or type, have it take screenshots and describe what it sees. Verify that its understanding matches reality. Then enable actions one at a time.

Set action limits. Cap the number of actions per session -- start with 20. An agent in a confused loop can click thousands of times. Action limits prevent runaway behavior.

Log everything. Record every screenshot and every action. This creates an audit trail for debugging and a training dataset for improvement. You will learn how to build GIF-based audit trails in Lesson 9.

Anti-Patterns

Common Setup Mistakes

Wrong resolution. Telling the API your screen is 1920x1080 when it is actually 2560x1440. Every click will miss its target by a wide margin. Always measure and report the actual resolution of your virtual display.

No screenshot after action. Clicking a button but not sending a screenshot of the result. The AI has no idea what happened. Always capture and send a screenshot after every action so the AI can verify the result and plan the next step.

Running on the main desktop. Giving the AI control of your actual computer. One wrong click could open your email, send a message, or delete files. Always use a sandbox. This is non-negotiable.

Try It Yourself

Set up your first computer use environment:

Option 1 (Docker): Use Anthropic's reference container
  docker run -p 5900:5900 ghcr.io/anthropics/anthropic-quickstarts:computer-use

Option 2 (Local VM): Use VirtualBox or UTM with a Linux desktop
  Set the display to 1920x1080 for consistent coordinates

Option 3 (Cloud): Spin up a GCP/AWS instance with a desktop environment
  Use VNC or noVNC for remote access to the virtual display

Once running, send your first screenshot-only request to Claude.
Verify: does the AI correctly describe what is on screen?

Review

Key concepts.

The Four Computer Use Actions

Screenshot (capture the screen), Click (move cursor and click at x,y coordinates), Type (send keystrokes to active element), Scroll (scroll up or down on the page).

The Coordinate System

Origin (0,0) is the top-left corner. X increases rightward, Y increases downward. Always report actual screen resolution to the API for accurate targeting.

The Execution Loop

Send task -> Claude requests screenshot -> You capture and send it -> Claude suggests action -> You execute and send new screenshot -> Repeat until task complete.

Sandbox Rule

NEVER run computer use on your primary desktop. Always use a VM, Docker container, or cloud instance. One wrong click on your real machine can cause real damage.

Screenshot After Every Action

Always capture and send a screenshot after every click, type, or scroll action. Without it, the AI has no way to verify what happened and cannot plan the next step.

Action Limits

Cap the number of actions per session (start with 20). Prevents runaway behavior if the agent enters a confused loop. Increase gradually as you build confidence.

Check Your Understanding

Computer use API quiz.

Computer Use API

1What are the four actions available through Claude computer use?

2Why must you always send a screenshot after executing an action?

3Why should you never run computer use on your primary desktop?

Computer Use API

Lesson Content

What you'll learn

The Computer Use Tool

The Coordinate System

Your First Computer Use Session

Setting Up the API Call

The Execution Loop

Understanding the computer use flow.

Running Computer Use Safely

Common Setup Mistakes

Try It Yourself

Key concepts.

Computer Use API

Computer use API quiz.

Computer Use API