Claude's context window is one of the largest available in any commercial AI model — 200,000 tokens across all current production models. But knowing the number is only the beginning. Understanding what fits inside that window, how tokens are counted, and what happens when you approach the limit is what separates developers who get consistent results from developers who get mysterious failures.
What Is a Context Window?
The context window is the maximum amount of text Claude can see and reason about in a single API call. Everything you send — system prompt, conversation history, documents, user message — must fit within this limit. Output tokens count toward the limit too, on the output side.
Think of it as Claude's working memory for one interaction. Unlike a database or file system, Claude cannot retrieve information from previous conversations. Each API call is completely fresh. If you need Claude to know something, it must be in the current context window.
Claude Context Window Sizes (2026)
| Model | Context Window | Max Output |
|---|---|---|
| claude-opus-4-6 | 200,000 tokens | 32,000 tokens |
| claude-sonnet-4-6 | 200,000 tokens | 16,000 tokens |
| claude-haiku-4-5 | 200,000 tokens | 8,000 tokens |
The input context (what you send) and the output (what Claude generates) share this window differently. Input tokens are what you send to Claude. Output tokens are what Claude generates back. You pay for both, and they are counted separately in the API response.
What Counts as a Token?
A token is roughly 3-4 characters of English text, or about 0.75 words. Practical equivalents:
- 1,000 tokens ≈ 750 words ≈ 1.5 pages of standard prose
- 10,000 tokens ≈ 7,500 words ≈ a short report or long article
- 100,000 tokens ≈ 75,000 words ≈ a full-length novel
- 200,000 tokens ≈ 150,000 words ≈ two full-length novels
Non-English text, code, and special characters tokenize differently. Python code tokenizes roughly 1 token per 3-4 characters. JSON with many brackets and quotes uses more tokens per word than plain prose. URLs and long identifiers are token-expensive relative to their semantic content.
How to Count Tokens Before Sending
The Anthropic Python SDK includes a token counting method that tells you exactly how many tokens your request will use before you send it:
import anthropic
client = anthropic.Anthropic()
# Count tokens before making the actual API call
token_count = client.messages.count_tokens(
model="claude-sonnet-4-6",
system="You are a document analysis assistant.",
messages=[
{"role": "user", "content": document_text}
]
)
print(f"Input tokens: {token_count.input_tokens}")
# Only proceed if within budget
if token_count.input_tokens < 150_000: # leave room for output
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
system="You are a document analysis assistant.",
messages=[{"role": "user", "content": document_text}]
)This is especially useful before processing unknown-size documents. A 500-page PDF might be 200,000 tokens on its own — knowing that before the API call prevents an error and lets you decide whether to split the document.
What Fits in 200,000 Tokens
To make the limit concrete:
- A full codebase — a medium-sized project (50-100 files) often fits within 100K tokens, leaving room for instructions and output
- Legal documents — a 200-page contract is typically 50,000-80,000 tokens
- Research papers — 10-20 academic papers typically fit
- Conversation history — a multi-hour customer support session might use 5,000-15,000 tokens
- Books — most novels fit entirely in one context window
What does NOT fit: very large codebases (thousands of files), full databases, or streaming content longer than roughly 150,000 words. For these, you need chunking or retrieval strategies.
Managing the Context Window in Multi-turn Conversations
In a chatbot or assistant, conversation history grows with each turn. If you naively append every message, you will eventually hit the limit. Common strategies:
1. Sliding Window
Keep only the most recent N turns in the messages array. Simple and effective for short-horizon tasks:
MAX_TURNS = 20 # keep last 20 messages
def trim_history(messages: list) -> list:
if len(messages) > MAX_TURNS:
# Always keep the first message (context setup) + recent messages
return [messages[0]] + messages[-(MAX_TURNS - 1):]
return messages2. Token-based Trimming
More precise — trim based on token count rather than message count:
def trim_to_budget(messages: list, system: str, budget: int = 150_000) -> list:
while len(messages) > 1:
count = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=system,
messages=messages
)
if count.input_tokens <= budget:
break
messages = messages[2:] # remove oldest user+assistant pair
return messages3. Summarization
When old context is too important to discard, have Claude summarize it first, then replace the old messages with the summary:
def summarize_and_compress(old_messages: list) -> str:
summary_response = client.messages.create(
model="claude-haiku-4-5", # use cheaper model for summarization
max_tokens=1000,
system="Summarize this conversation history concisely, preserving all key decisions, facts, and context.",
messages=[{"role": "user", "content": str(old_messages)}]
)
return summary_response.content[0].textChunking Long Documents
When a document exceeds the context window (or approaches it, leaving little room for output), split it into overlapping chunks:
def chunk_text(text: str, chunk_tokens: int = 80_000, overlap_tokens: int = 2_000) -> list[str]:
"""Split text into overlapping chunks measured in approximate tokens."""
# Rough estimate: 4 chars per token
chars_per_chunk = chunk_tokens * 4
overlap_chars = overlap_tokens * 4
chunks = []
start = 0
while start < len(text):
end = start + chars_per_chunk
chunks.append(text[start:end])
start = end - overlap_chars
return chunks
# Process each chunk
results = []
for chunk in chunk_text(large_document):
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4000,
system="Summarize the key information in this document section.",
messages=[{"role": "user", "content": chunk}]
)
results.append(result.content[0].text)
# Final synthesis
final = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8000,
system="Synthesize these document summaries into a coherent final summary.",
messages=[{"role": "user", "content": "\n\n---\n\n".join(results)}]
)Context Window vs. Memory
A common misconception: Claude does not have memory between conversations. The 200K context window is per-API-call working space, not persistent storage. When the call ends, Claude forgets everything.
For persistent memory across sessions, you need an external solution:
- Database + retrieval: Store conversation history in a database, retrieve relevant chunks before each call
- Vector search (RAG): Embed past conversations and retrieve semantically similar content
- Structured state: Maintain a JSON state object that you pass in the system prompt each session
- Specialized tools: Libraries like sovereign-brain or claude-brain manage persistent context as a layer above the API
Cost Implications of Large Context Windows
Every token in the context window costs money — input tokens on the way in, output tokens on the way back. A 200K token request is 200x more expensive on input than a 1K token request. For workflows that repeatedly process the same large context, prompt caching reduces the cost by up to 90% on the static portion.
Rule of thumb: use the smallest context that contains the information Claude actually needs. Padding the context with irrelevant content wastes money and, at very high token counts, can dilute Claude attention on the relevant parts.
What Happens When You Exceed the Limit
The API returns an error — specifically a 400 with a message about the context window being exceeded. Nothing is silently truncated. This means you need to handle the error explicitly:
from anthropic import BadRequestError
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
messages=[{"role": "user", "content": very_long_text}]
)
except BadRequestError as e:
if "prompt is too long" in str(e).lower():
# Retry with chunked or summarized input
handle_context_overflow(very_long_text)
else:
raiseThe Bottom Line
200,000 tokens is enough for most real-world documents, codebases, and conversations. The limit is generous — the challenge is using it efficiently. Count tokens before sending large content, implement trimming for long conversations, and use chunking for documents that exceed the window. For repeated large-context calls, prompt caching is the cost optimization to implement first.
For more on optimizing your API usage, see our prompt caching guide and system prompt guide. Building agents that need persistent memory across sessions? Our persistent memory patterns guide covers the architecture options in detail.