What is Claude's context window size?

All current production Claude models have a 200,000 token context window. This includes claude-sonnet-4-6, claude-opus-4-6, and claude-haiku-4-5. The context window is the total amount of text Claude can process in a single API call — including your system prompt, conversation history, documents, and user messages.

How many words fit in Claude's 200K context window?

200,000 tokens is roughly 150,000 words or about two full-length novels. In practical terms: 100,000 tokens fits a medium-sized codebase (50-100 files), a 200-page legal document is typically 50,000-80,000 tokens, and most academic papers are 5,000-10,000 tokens each.

How do I count tokens before calling the Claude API?

Use client.messages.count_tokens() in the Anthropic Python SDK. Pass the same model, system, and messages you plan to use, and it returns the input token count before making the actual API call. This lets you validate content size, implement token budgets, and avoid 400 errors from exceeding the context window.

What happens if I send too much text to Claude?

The API returns a 400 BadRequestError with a message about the context window being exceeded. Nothing is silently truncated — Claude either processes the full input or returns an error. Handle the error explicitly and retry with chunked, summarized, or trimmed input.

How do I handle Claude running out of context in a long conversation?

Use one of three strategies: (1) Sliding window — keep only the most recent N messages. (2) Token-based trimming — count tokens and remove oldest message pairs until within budget. (3) Summarization — have Claude summarize older parts of the conversation, then replace those messages with the summary. The right strategy depends on whether early context is still relevant.

How much does it cost to use Claude's full 200K context window?

A full 200,000 token input call costs significantly more than a small call — input is priced per million tokens, so a 200K token request costs 200x more than a 1K request. For repeated calls with the same large context, prompt caching reduces costs by up to 90% on the static portions. Always use the smallest context that contains the information Claude actually needs.

What is the difference between context window and max output tokens?

The context window (200K tokens) is the maximum total size of everything Claude processes — input and output combined. Max output tokens is a separate parameter you set that limits how long Claude's response can be. For claude-sonnet-4-6, max output is 16,000 tokens; for claude-opus-4-6, it is 32,000 tokens. Set max_tokens in your API call to control output length.

Can Claude process an entire book in one API call?

Yes, most books fit within Claude's 200K token context window. A typical novel (80,000-100,000 words) uses roughly 100,000-130,000 tokens, leaving room for a system prompt and output. Very long books or academic volumes may need to be split into two calls. Use client.messages.count_tokens() to verify before sending.

How do I process a document larger than Claude's context window?

Split the document into overlapping chunks using a chunking function, process each chunk separately, then synthesize the results in a final call. Use an overlap of 1,000-2,000 tokens between chunks to preserve context at boundaries. For documents that need semantic search rather than full processing, build a RAG pipeline that retrieves only the relevant sections.

Claude Context Window: 200K Tokens Explained

Q: Does Claude remember previous conversations?

No. Claude has no memory between separate API calls or conversations. The 200K context window is per-call working space only — when the call ends, Claude forgets everything. For persistent memory across sessions, you need an external solution: a database with retrieval, a vector store for semantic search, or a structured state object passed in each system prompt.

Claude's context window explained: 200K token limit, what fits inside, how to count tokens, and practical strategies for long-document AI workflows.

Claude's context window is one of the largest available in any commercial AI model — 200,000 tokens across all current production models. But knowing the number is only the beginning. Understanding what fits inside that window, how tokens are counted, and what happens when you approach the limit is what separates developers who get consistent results from developers who get mysterious failures.

What Is a Context Window?

The context window is the maximum amount of text Claude can see and reason about in a single API call. Everything you send — system prompt, conversation history, documents, user message — must fit within this limit. Output tokens count toward the limit too, on the output side.

Think of it as Claude's working memory for one interaction. Unlike a database or file system, Claude cannot retrieve information from previous conversations. Each API call is completely fresh. If you need Claude to know something, it must be in the current context window.

Claude Context Window Sizes (2026)

Model	Context Window	Max Output
claude-opus-4-6	200,000 tokens	32,000 tokens
claude-sonnet-4-6	200,000 tokens	16,000 tokens
claude-haiku-4-5	200,000 tokens	8,000 tokens

The input context (what you send) and the output (what Claude generates) share this window differently. Input tokens are what you send to Claude. Output tokens are what Claude generates back. You pay for both, and they are counted separately in the API response.

What Counts as a Token?

A token is roughly 3-4 characters of English text, or about 0.75 words. Practical equivalents:

1,000 tokens ≈ 750 words ≈ 1.5 pages of standard prose
10,000 tokens ≈ 7,500 words ≈ a short report or long article
100,000 tokens ≈ 75,000 words ≈ a full-length novel
200,000 tokens ≈ 150,000 words ≈ two full-length novels

Non-English text, code, and special characters tokenize differently. Python code tokenizes roughly 1 token per 3-4 characters. JSON with many brackets and quotes uses more tokens per word than plain prose. URLs and long identifiers are token-expensive relative to their semantic content.

How to Count Tokens Before Sending

The Anthropic Python SDK includes a token counting method that tells you exactly how many tokens your request will use before you send it:

import anthropic

client = anthropic.Anthropic()

# Count tokens before making the actual API call
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="You are a document analysis assistant.",
    messages=[
        {"role": "user", "content": document_text}
    ]
)

print(f"Input tokens: {token_count.input_tokens}")

# Only proceed if within budget
if token_count.input_tokens < 150_000:  # leave room for output
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16000,
        system="You are a document analysis assistant.",
        messages=[{"role": "user", "content": document_text}]
    )

This is especially useful before processing unknown-size documents. A 500-page PDF might be 200,000 tokens on its own — knowing that before the API call prevents an error and lets you decide whether to split the document.

What Fits in 200,000 Tokens

To make the limit concrete:

A full codebase — a medium-sized project (50-100 files) often fits within 100K tokens, leaving room for instructions and output
Legal documents — a 200-page contract is typically 50,000-80,000 tokens
Research papers — 10-20 academic papers typically fit
Conversation history — a multi-hour customer support session might use 5,000-15,000 tokens
Books — most novels fit entirely in one context window

What does NOT fit: very large codebases (thousands of files), full databases, or streaming content longer than roughly 150,000 words. For these, you need chunking or retrieval strategies.

Managing the Context Window in Multi-turn Conversations

In a chatbot or assistant, conversation history grows with each turn. If you naively append every message, you will eventually hit the limit. Common strategies:

1. Sliding Window

Keep only the most recent N turns in the messages array. Simple and effective for short-horizon tasks:

MAX_TURNS = 20  # keep last 20 messages

def trim_history(messages: list) -> list:
    if len(messages) > MAX_TURNS:
        # Always keep the first message (context setup) + recent messages
        return [messages[0]] + messages[-(MAX_TURNS - 1):]
    return messages

2. Token-based Trimming

More precise — trim based on token count rather than message count:

def trim_to_budget(messages: list, system: str, budget: int = 150_000) -> list:
    while len(messages) > 1:
        count = client.messages.count_tokens(
            model="claude-sonnet-4-6",
            system=system,
            messages=messages
        )
        if count.input_tokens <= budget:
            break
        messages = messages[2:]  # remove oldest user+assistant pair
    return messages

3. Summarization

When old context is too important to discard, have Claude summarize it first, then replace the old messages with the summary:

def summarize_and_compress(old_messages: list) -> str:
    summary_response = client.messages.create(
        model="claude-haiku-4-5",  # use cheaper model for summarization
        max_tokens=1000,
        system="Summarize this conversation history concisely, preserving all key decisions, facts, and context.",
        messages=[{"role": "user", "content": str(old_messages)}]
    )
    return summary_response.content[0].text

Chunking Long Documents

When a document exceeds the context window (or approaches it, leaving little room for output), split it into overlapping chunks:

def chunk_text(text: str, chunk_tokens: int = 80_000, overlap_tokens: int = 2_000) -> list[str]:
    """Split text into overlapping chunks measured in approximate tokens."""
    # Rough estimate: 4 chars per token
    chars_per_chunk = chunk_tokens * 4
    overlap_chars = overlap_tokens * 4
    chunks = []
    start = 0
    while start < len(text):
        end = start + chars_per_chunk
        chunks.append(text[start:end])
        start = end - overlap_chars
    return chunks

# Process each chunk
results = []
for chunk in chunk_text(large_document):
    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4000,
        system="Summarize the key information in this document section.",
        messages=[{"role": "user", "content": chunk}]
    )
    results.append(result.content[0].text)

# Final synthesis
final = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    system="Synthesize these document summaries into a coherent final summary.",
    messages=[{"role": "user", "content": "\n\n---\n\n".join(results)}]
)

Context Window vs. Memory

A common misconception: Claude does not have memory between conversations. The 200K context window is per-API-call working space, not persistent storage. When the call ends, Claude forgets everything.

For persistent memory across sessions, you need an external solution:

Database + retrieval: Store conversation history in a database, retrieve relevant chunks before each call
Vector search (RAG): Embed past conversations and retrieve semantically similar content
Structured state: Maintain a JSON state object that you pass in the system prompt each session
Specialized tools: Libraries like sovereign-brain or claude-brain manage persistent context as a layer above the API

Cost Implications of Large Context Windows

Every token in the context window costs money — input tokens on the way in, output tokens on the way back. A 200K token request is 200x more expensive on input than a 1K token request. For workflows that repeatedly process the same large context, prompt caching reduces the cost by up to 90% on the static portion.

Rule of thumb: use the smallest context that contains the information Claude actually needs. Padding the context with irrelevant content wastes money and, at very high token counts, can dilute Claude attention on the relevant parts.

What Happens When You Exceed the Limit

The API returns an error — specifically a 400 with a message about the context window being exceeded. Nothing is silently truncated. This means you need to handle the error explicitly:

from anthropic import BadRequestError

try:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16000,
        messages=[{"role": "user", "content": very_long_text}]
    )
except BadRequestError as e:
    if "prompt is too long" in str(e).lower():
        # Retry with chunked or summarized input
        handle_context_overflow(very_long_text)
    else:
        raise

The Bottom Line

200,000 tokens is enough for most real-world documents, codebases, and conversations. The limit is generous — the challenge is using it efficiently. Count tokens before sending large content, implement trimming for long conversations, and use chunking for documents that exceed the window. For repeated large-context calls, prompt caching is the cost optimization to implement first.

For more on optimizing your API usage, see our prompt caching guide and system prompt guide. Building agents that need persistent memory across sessions? Our persistent memory patterns guide covers the architecture options in detail.