← Back to Blog

How to Use the Claude API in Python

Build with the Claude API using Python. Messages, streaming, tool use, and vision with production code examples.


The Claude API is the foundation of everything you build with Claude beyond the chat interface. Every autonomous agent, every MCP server, every production AI workflow starts here — with the Messages API and a few lines of Python.

We build everything at Like One on the Claude API. Our Telegram bot, our brain infrastructure, our content pipeline, our accessibility tools. This is not a documentation mirror. This is how we actually use the API in production, with the patterns and pitfalls we have learned across thousands of API calls.

If you are deciding between Claude and other models, read our comparison guide first. If you already know Claude is right and want to build, keep reading.

Setup and Authentication

Install the official Anthropic Python SDK. It handles authentication, retries, rate limiting, and type safety out of the box.

pip install anthropic

Get your API key from the Anthropic Console. Store it as an environment variable — never hardcode API keys in source code.

export ANTHROPIC_API_KEY="sk-ant-..."

The SDK reads the environment variable automatically. Your first API call is three lines:

import anthropic

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain what an API is in one sentence."}
    ]
)
print(message.content[0].text)

That is a complete, working program. The SDK handles the HTTP request, authentication header, response parsing, and error handling. If the API key is missing, you get a clear error. If the request fails transiently, the SDK retries automatically with exponential backoff.

Understanding the Messages API

The Messages API is Claude's primary interface. Every interaction — single turn, multi-turn conversation, tool use, vision — goes through the same endpoint. The structure is consistent:

response = client.messages.create(
    model="claude-sonnet-4-6",      # Which model to use
    max_tokens=1024,                 # Maximum response length
    system="You are a helpful assistant.",  # Optional system prompt
    messages=[                       # Conversation history
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"},
        {"role": "user", "content": "What can you do?"}
    ]
)

Key parameters you should know:

  • model: The model ID. Current options include claude-opus-4-6 (most capable), claude-sonnet-4-6 (best balance), and claude-haiku-4-5 (fastest and cheapest). See our model comparison for choosing the right one.
  • max_tokens: Required. Sets the upper bound on response length. Claude stops generating when it hits this limit or finishes naturally. Set it higher than you need — you only pay for tokens actually generated.
  • system: The system prompt. Sets Claude's behavior, persona, and constraints. This is where you define what your application does. Unlike messages, the system prompt is not part of the conversation — it frames everything that follows.
  • temperature: Controls randomness. Default is 1.0. Lower values (0.0-0.3) for deterministic outputs like code generation. Higher values (0.7-1.0) for creative writing. Most production applications work best between 0.0 and 0.5.
  • messages: The conversation history. Alternating user and assistant messages. Claude sees the full history on every request — there is no server-side session state. You manage the conversation.

The Response Object

The API returns a structured response with everything you need:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about APIs"}]
)

print(response.id)              # Unique message ID
print(response.model)           # Model that responded
print(response.content[0].text) # The actual response text
print(response.stop_reason)     # Why generation stopped
print(response.usage)           # Token counts

The usage field is critical for cost management. It shows input_tokens and output_tokens separately, so you can track exactly where your API spend goes. In our experience, most applications spend 70-80% on input tokens (system prompts and conversation history) and 20-30% on output.

The stop_reason tells you why Claude stopped: end_turn means it finished naturally, max_tokens means it hit your limit (you probably need a higher value), and tool_use means Claude wants to call a tool.

Streaming Responses

For user-facing applications, streaming is not optional. Nobody wants to wait 10 seconds staring at a blank screen. Streaming sends tokens as they are generated, so users see the response forming in real time.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain streaming APIs"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

The streaming interface uses a context manager that handles connection lifecycle automatically. The text_stream iterator yields text chunks as they arrive. Each chunk is typically a few tokens — sometimes a single word, sometimes a phrase.

For more control over the stream, use the raw event stream:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            print(event.delta.text, end="")
        elif event.type == "message_stop":
            print("\n[Done]")

Events include message_start, content_block_start, content_block_delta (the actual text), content_block_stop, and message_stop. For simple text display, text_stream is sufficient. For building sophisticated UIs with typing indicators, progress bars, or tool call animations, the raw events give you full control.

Multi-Turn Conversations

Claude is stateless. Every API call is independent. To have a conversation, you send the entire history with each request. This gives you complete control over what Claude remembers and forgets.

conversation = []

def chat(user_message):
    conversation.append({"role": "user", "content": user_message})
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="You are a Python tutor. Be concise.",
        messages=conversation
    )
    
    assistant_message = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

print(chat("What is a list comprehension?"))
print(chat("Show me a complex example."))
print(chat("Now explain that example line by line."))

Each call sends the full conversation, so Claude has context from all previous turns. The cost grows with conversation length because you resend all previous messages as input tokens. For long conversations, implement a sliding window or summarization strategy to keep costs manageable.

A practical pattern we use in production: keep the last 10 messages in full, summarize older messages into a condensed context block that gets prepended to the conversation. This preserves important context while capping token costs. For a deeper dive into how we handle persistent context, see our persistent memory guide.

Tool Use (Function Calling)

Tool use is where the Claude API transforms from a text generator into an agent runtime. You define tools — functions Claude can call — and Claude decides when to use them based on the conversation. This is the mechanism that powers agentic loops.

import json

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city name, e.g. San Francisco"
                }
            },
            "required": ["city"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What is the weather in Tokyo?"}]
)

# Claude responds with a tool_use content block
for block in response.content:
    if block.type == "tool_use":
        print(f"Tool: {block.name}")
        print(f"Input: {block.input}")
        # Output: Tool: get_weather, Input: {"city": "Tokyo"}

When Claude decides to use a tool, the response contains a tool_use content block instead of (or alongside) text. Your application executes the function with the provided inputs and sends the result back:

# After getting the tool_use response, execute the tool and send results back
messages = [
    {"role": "user", "content": "What is the weather in Tokyo?"},
    {"role": "assistant", "content": response.content},
    {
        "role": "user",
        "content": [
            {
                "type": "tool_result",
                "tool_use_id": response.content[0].id,
                "content": '{"temperature": 22, "condition": "partly cloudy"}'
            }
        ]
    }
]

final_response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=messages
)

print(final_response.content[0].text)
# "The weather in Tokyo is currently 22 degrees C and partly cloudy."

The tool use loop is the core of agentic AI: Claude reasons about the task, decides which tool to use, you execute it, Claude reasons about the result, and decides whether to use another tool or respond to the user. This loop can run for multiple iterations, with Claude chaining tools together to accomplish complex tasks.

Building a Complete Tool Loop

In production, you need a loop that handles multiple tool calls:

def run_agent(user_message, tools, tool_handlers):
    messages = [{"role": "user", "content": user_message}]
    
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )
        
        # If Claude is done (no tool calls), return the text
        if response.stop_reason == "end_turn":
            return response.content[0].text
        
        # Process tool calls
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        
        for block in response.content:
            if block.type == "tool_use":
                handler = tool_handlers[block.name]
                result = handler(**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)
                })
        
        messages.append({"role": "user", "content": tool_results})

This is the skeleton of every agent. The Claude Agent SDK builds on this pattern with guardrails, observability, and multi-agent orchestration — but the core loop is exactly this.

Vision (Image Understanding)

Claude can process images alongside text. Send images as base64-encoded data or as URLs in your messages:

import base64
from pathlib import Path

# From a local file
image_data = base64.standard_b64encode(
    Path("screenshot.png").read_bytes()
).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "What accessibility issues do you see in this UI?"
                }
            ]
        }
    ]
)

Vision works with PNG, JPEG, GIF, and WebP. Claude handles screenshots, diagrams, charts, documents, and photographs. We use it extensively in our accessibility audit tool to analyze UI screenshots for visual compliance issues that code-only scanners miss.

Image tokens are calculated based on image dimensions. A 1000x1000 image costs roughly 1,600 tokens. Resize images before sending them if you are processing many — the quality difference between a 2000px and 1000px screenshot is rarely worth the 4x token cost.

System Prompts That Work

The system prompt is the most underused feature of the API. Most developers write a single sentence and move on. A well-crafted system prompt is the difference between a generic chatbot and a specialized tool.

Patterns that produce the best results:

system = """You are a senior Python code reviewer for a fintech company.

Rules:
- Flag any security vulnerability immediately, before other feedback
- Check for SQL injection, XSS, command injection in every review
- Suggest type annotations where missing
- Keep feedback actionable — every comment should include a fix

Context:
- The codebase uses FastAPI, SQLAlchemy, and Pydantic
- Database is PostgreSQL 16
- All endpoints require JWT authentication

Format:
- Start with a severity rating: CRITICAL / WARNING / INFO
- Group feedback by file
- End with a summary of the most important change"""

Structure your system prompt in three sections: identity (who Claude is), rules (what it must and must not do), and context (what it needs to know about the environment). This pattern works because it mirrors how custom instructions work in Claude Code — the same principles apply at the API level.

Error Handling in Production

The SDK raises typed exceptions that map directly to API error conditions:

from anthropic import (
    APIError,
    AuthenticationError,
    RateLimitError,
    APITimeoutError,
    BadRequestError
)

try:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
except AuthenticationError:
    # Invalid API key — check your ANTHROPIC_API_KEY
    raise
except RateLimitError as e:
    # Too many requests — the SDK retries automatically,
    # but if you still hit this, you need to queue requests
    print(f"Rate limited. Retry after: {e.response.headers}")
except APITimeoutError:
    # Request took too long — consider shorter prompts
    # or increasing the client timeout
    pass
except BadRequestError as e:
    # Invalid request (too many tokens, bad format, etc.)
    print(f"Bad request: {e.message}")
except APIError as e:
    # Catch-all for other API errors (500s, etc.)
    print(f"API error: {e.status_code}: {e.message}")

The SDK retries transient errors (429, 500, 503) automatically with exponential backoff. You do not need to implement retry logic unless you want custom behavior. For production applications, set a timeout on the client to prevent hung requests:

client = anthropic.Anthropic(
    timeout=60.0,  # 60 second timeout
    max_retries=3  # Retry up to 3 times on transient errors
)

Async Support

For web applications, use the async client. It uses the same interface but runs non-blocking:

import anthropic
import asyncio

async def main():
    client = anthropic.AsyncAnthropic()
    
    message = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(message.content[0].text)

asyncio.run(main())

Async streaming works the same way with async for:

async with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
) as stream:
    async for text in stream.text_stream:
        print(text, end="")

Use async when your application handles multiple concurrent requests — web servers, Telegram bots, batch processors. Use sync when you are writing scripts, CLI tools, or single-threaded applications. The API behavior is identical; only the execution model differs.

Cost Optimization

API costs are based on tokens in and tokens out. Here is how to optimize without sacrificing quality:

  • Choose the right model. Haiku 4.5 costs roughly 10x less than Opus 4.6. For classification, extraction, and simple generation tasks, Haiku is often sufficient. Reserve Opus for complex reasoning and Sonnet for the middle ground. Our model comparison guide has detailed benchmarks.
  • Trim conversation history. Every message you send gets re-processed as input tokens. Implement a sliding window or summary-based compression for long conversations.
  • Cache with prompt caching. If you send the same system prompt with every request (you should), Anthropic's prompt caching reduces the cost of those repeated tokens by up to 90%. Enable it by adding cache_control markers to your system prompt blocks.
  • Resize images. A 4K screenshot costs 4x more tokens than a 1080p one, with marginal quality improvement for most tasks.
  • Set appropriate max_tokens. This does not reduce cost (you pay for actual output, not the limit) but prevents runaway responses that waste tokens on unnecessary verbosity.

Need Help Building with the Claude API?

From prototype to production — our consulting team builds Claude-powered applications, custom agents, and API integrations. We ship code, not slide decks.

Prompt Caching

Prompt caching is the single biggest cost optimization for production applications. If your system prompt is 2,000 tokens and you make 1,000 API calls per day, you are paying for 2 million input tokens that are identical every time. Prompt caching stores the processed system prompt on Anthropic's servers and reuses it across requests.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt here...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Hello"}]
)

The cache_control marker tells Anthropic to cache everything up to that point. Cached tokens cost 90% less on subsequent requests. The cache has a 5-minute TTL that resets on each use, so active applications keep the cache warm automatically.

You can also cache the first few turns of a conversation. If every conversation starts with the same few-shot examples, mark them for caching and save on every request after the first.

Batches API for Bulk Processing

When you need to process hundreds or thousands of requests and do not need real-time responses, the Batches API cuts costs by 50%. You submit a batch of requests, and Anthropic processes them within 24 hours at a discounted rate.

batch = client.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": text}]
            }
        }
        for i, text in enumerate(documents)
    ]
)

# Check status later
status = client.batches.retrieve(batch.id)
if status.processing_status == "ended":
    results = client.batches.results(batch.id)

Best for: document processing, data extraction, content generation, classification tasks — anything where latency is not critical but volume is high.

Extended Thinking

For complex reasoning tasks — math, logic, multi-step analysis, code architecture — extended thinking lets Claude reason through the problem step by step before responding. This produces significantly better results on hard problems at the cost of more output tokens.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # How much thinking to allow
    },
    messages=[{"role": "user", "content": "Design a rate limiter..."}]
)

for block in response.content:
    if block.type == "thinking":
        print(f"Thinking: {block.thinking}")
    elif block.type == "text":
        print(f"Response: {block.text}")

The thinking content shows Claude's internal reasoning process. You can display it to users for transparency or log it for debugging. The budget_tokens parameter controls how much thinking Claude does — higher budgets produce better results on harder problems but cost more.

Production Architecture Patterns

After building multiple production systems on the Claude API, these patterns have proven reliable:

Separate your client from your logic. Create a thin wrapper around the Anthropic client that handles your application's defaults — model selection, system prompt, timeout, retry policy. Your business logic calls the wrapper, not the SDK directly. This makes model upgrades and configuration changes a single-line fix.

Log every request and response. Token counts, latency, stop reasons, and error rates. You cannot optimize what you do not measure. We log to structured JSON and aggregate with simple Python scripts — no observability platform needed until you hit serious scale.

Implement circuit breakers. If the API returns errors on 5 consecutive requests, stop sending requests for 30 seconds. This prevents cascade failures where your application hammers a degraded API and makes everything worse.

Use model fallbacks. If Opus times out, fall back to Sonnet. If Sonnet is rate-limited, fall back to Haiku. Graceful degradation is better than total failure. Your users would rather get a slightly less sophisticated response than an error page.

Validate tool inputs. Claude usually generates valid tool inputs, but edge cases exist. Validate the JSON schema of tool inputs before executing functions, especially for tools that modify data or call external services. Trust but verify.

From API Calls to Agents

The Claude API is the primitive. The tool use loop is the mechanism. Agents are what you build with both. Every concept in this guide — messages, streaming, tools, vision, system prompts — composes into agent architectures that automate real work.

The progression is clear: start with simple API calls to understand the interface. Add tool use to give Claude capabilities. Wrap tool use in a loop to create agents. Add persistent memory to create agents that learn. Connect agents to external systems via MCP to create agents that act in the real world.

Every step builds on the previous one. The API is where it all starts.

For the official Agent SDK that handles the boilerplate, see our Agent SDK tutorial. For choosing between Claude Code and API-based development, read our Claude Code guide. And for comparing Claude's API against the alternatives, check our full model comparison.

If you are building professionally with the Claude API, consider getting certified. Our CCA exam prep guide walks you through the exam domains, study plan, and practice questions.

And for giving Claude access to your own data through retrieval-augmented generation, read our RAG tutorial.

For on-device AI that complements the Claude API, see our Apple Foundation Models guide — Apple's Neural Engine handles fast structured output while Claude handles deep reasoning.


Keep learning — for free

52 AI courses. 520+ lessons. No paywall for starters.

Need help building this?

We build MCP servers, Claude workflows, and AI agents for teams. Strategy calls start at $150/hr.