Every Claude API key operates under rate limits — caps on how many requests and tokens you can send per minute, tied to your account's usage tier. Hit one and you get a 429 response instead of a completion. For a single prototype script, that rarely matters. For a production agent, a batch job, or anything serving concurrent users, understanding these limits — and building around them — is the difference between a system that degrades gracefully and one that falls over during a traffic spike.
This guide covers how Claude's rate limits work, how to read the response headers that tell you exactly where you stand, and the retry and backoff patterns that keep your application running smoothly when you brush up against a ceiling.
How Claude's Rate Limits Work
Anthropic enforces limits along three dimensions simultaneously, and you can be capped by whichever one you hit first:
- RPM — requests per minute
- ITPM — input tokens per minute
- OTPM — output tokens per minute
Your actual numbers depend on your usage tier, which is determined by your account's spending history — new accounts start at the lowest tier with conservative limits, and tiers increase automatically as your account accrues usage and payment history. There is no manual upgrade request needed in most cases; check your exact current limits in the Anthropic Console under Settings → Limits, since they vary by tier and model and change over time.
Limits apply per organization, not per API key — if you have multiple keys under one account, they all draw from the same pool.
Reading Rate Limit Headers
Every API response — successful or not — includes headers telling you exactly how much headroom you have left. Check these instead of guessing:
| Header | Meaning |
|---|---|
anthropic-ratelimit-requests-limit | Max requests allowed in the current window |
anthropic-ratelimit-requests-remaining | Requests left before you hit the cap |
anthropic-ratelimit-requests-reset | Timestamp when the request count resets |
anthropic-ratelimit-input-tokens-remaining | Input tokens left in the current window |
anthropic-ratelimit-output-tokens-remaining | Output tokens left in the current window |
retry-after | Seconds to wait before retrying (present on 429 responses) |
Reading these headers proactively — and slowing down before you hit zero — is far better UX than waiting for a 429 and reacting after the fact.
import anthropic
client = anthropic.Anthropic()
response = client.messages.with_raw_response.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello, Claude"}]
)
print("Requests remaining:", response.headers.get("anthropic-ratelimit-requests-remaining"))
print("Input tokens remaining:", response.headers.get("anthropic-ratelimit-input-tokens-remaining"))
print("Output tokens remaining:", response.headers.get("anthropic-ratelimit-output-tokens-remaining"))
message = response.parse()
print(message.content[0].text)What a 429 Looks Like
When you exceed a limit, the API returns HTTP 429 with an error body identifying which limit you crossed:
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Number of request tokens has exceeded your per-minute rate limit"
}
}The response also includes a retry-after header telling you exactly how many seconds to wait. Respect it — retrying immediately just produces another 429 and wastes a request slot you don't have.
A separate error type, overloaded_error (HTTP 529), means Anthropic's infrastructure is under heavy load rather than your account hitting its own limit. Treat it the same way: back off and retry, since it is not something a rate limit increase would fix.
Exponential Backoff in Python
The official SDK already retries transient errors (including 429 and 529) automatically up to a default of 2 retries with backoff. For custom retry logic — or if you want to log and monitor retries — implement your own backoff loop:
import time
import random
import anthropic
client = anthropic.Anthropic()
def call_with_backoff(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
retry_after = e.response.headers.get("retry-after")
wait = float(retry_after) if retry_after else (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
except anthropic.APIStatusError as e:
if e.status_code == 529 and attempt < max_retries - 1:
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
raise
response = call_with_backoff([{"role": "user", "content": "Summarize this quarter's revenue."}])
print(response.content[0].text)Preferring the server's retry-after value over a fixed exponential curve matters — Anthropic knows exactly when your window resets, and using that number avoids both under-waiting (another 429) and over-waiting (idle time you didn't need).
Configuring SDK Retries Directly
Both official SDKs let you configure retry behavior at the client level instead of writing your own loop:
import anthropic
# Python: set max_retries on the client
client = anthropic.Anthropic(max_retries=5)
# Or override per-request
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
).with_options(max_retries=2)// TypeScript: same pattern
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ maxRetries: 5 });
const message = await client.messages.create(
{
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }]
},
{ maxRetries: 2 }
);The SDK's built-in retry already applies exponential backoff with jitter to 429 and 5xx responses, so for most applications, bumping max_retries is enough — you don't need custom retry code unless you want fine-grained logging or per-request retry-after handling.
Client-Side Rate Limiting (Staying Under the Ceiling)
Reactive retries handle occasional spikes, but for sustained high-throughput workloads — bulk classification, large document pipelines — it is more efficient to throttle your own request rate so you rarely hit a 429 at all. A simple token-bucket limiter around your request loop does this:
import time
from collections import deque
class RateLimiter:
def __init__(self, max_requests_per_minute):
self.max_requests = max_requests_per_minute
self.request_times = deque()
def wait_if_needed(self):
now = time.time()
while self.request_times and self.request_times[0] < now - 60:
self.request_times.popleft()
if len(self.request_times) >= self.max_requests:
sleep_time = 60 - (now - self.request_times[0])
if sleep_time > 0:
time.sleep(sleep_time)
self.request_times.append(time.time())
limiter = RateLimiter(max_requests_per_minute=50)
for prompt in prompts:
limiter.wait_if_needed()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)Set max_requests_per_minute comfortably below your actual tier limit (leave headroom for other traffic on the same account) rather than trying to max it out exactly.
When to Use the Batch API Instead
If you are processing a large volume of independent requests and hitting rate limits from sequential synchronous calls, the better fix is often not a smarter retry loop — it's switching to the Message Batches API. Batches operate under separate, much higher throughput limits than synchronous requests, cost 50% less, and remove the need for client-side throttling entirely since Anthropic manages the pacing on their end.
| Situation | Best Approach |
|---|---|
| Interactive chat, single user | Synchronous requests, SDK default retries are enough |
| Concurrent multi-user production app | Client-side rate limiter + exponential backoff on 429 |
| Large one-off bulk job (thousands of items) | Batch API — sidesteps synchronous rate limits entirely |
| Nightly scheduled processing | Batch API or a throttled loop run overnight |
| Agent loop with tool calls | Backoff with retry-after respected; consider prompt caching to cut token volume per call |
Reducing Token Usage to Stay Under Limits
Since ITPM and OTPM are counted independently from RPM, a workload can hit a token ceiling well before it hits a request ceiling — especially with long system prompts or large context windows. Two changes shrink token pressure without changing what you're asking Claude to do: prompt caching for repeated static context (system prompts, tool definitions, long documents reused across calls), and trimming context window bloat by summarizing or dropping irrelevant history rather than replaying an entire conversation on every turn.
The Bottom Line
Rate limits are not an edge case to patch around after a production incident — they are a normal part of the API's operating envelope and should be designed for from the start. Read the response headers instead of guessing, respect retry-after on 429s, configure sensible max_retries on the SDK client, and add client-side throttling for any sustained high-volume workload. For genuinely large batch jobs, skip the problem entirely and move to the Batch API, which operates under separate, higher limits built for exactly this use case.