What are Claude's API rate limits based on?

Claude enforces limits along three dimensions at once: RPM (requests per minute), ITPM (input tokens per minute), and OTPM (output tokens per minute). Your exact numbers depend on your account's usage tier, which increases automatically based on spending history. Check current limits in the Anthropic Console under Settings → Limits.

How do I check my remaining Claude API rate limit?

Read the response headers on every API call: anthropic-ratelimit-requests-remaining, anthropic-ratelimit-input-tokens-remaining, and anthropic-ratelimit-output-tokens-remaining. Using client.messages.with_raw_response.create() in Python exposes these headers directly so you can monitor headroom proactively instead of waiting for a 429.

What does a 429 error mean in the Claude API?

A 429 status means you exceeded one of your rate limits (RPM, ITPM, or OTPM). The error body has type 'rate_limit_error' and the response includes a retry-after header specifying exactly how many seconds to wait before retrying.

What is the difference between a 429 and a 529 error from Claude?

A 429 (rate_limit_error) means your own account exceeded its rate limit. A 529 (overloaded_error) means Anthropic's infrastructure is under heavy load independent of your account's limits. Both should be handled with backoff and retry, but a 529 will not be fixed by a rate limit tier increase.

Does the Claude SDK retry automatically on rate limit errors?

Yes. Both the Python and TypeScript SDKs retry 429 and 5xx errors automatically with exponential backoff, defaulting to 2 retries. You can increase this with the max_retries parameter on the client or per-request with .with_options(max_retries=N).

How do I implement exponential backoff for Claude API rate limits?

Catch anthropic.RateLimitError, read the retry-after header from the response if present, and sleep for that duration (or fall back to 2^attempt seconds with jitter) before retrying, up to a maximum retry count. Prefer the server's retry-after value over a fixed backoff curve since it reflects the actual reset time.

How can I avoid hitting Claude's rate limits in a bulk processing job?

Two options: add a client-side rate limiter (a token-bucket loop that throttles requests to stay comfortably under your tier's RPM) for moderate volume, or switch to the Message Batches API for large volumes — batches run under separate, higher throughput limits and cost 50% less than synchronous calls.

Are Claude API rate limits per API key or per account?

Rate limits apply per organization, not per individual API key. If multiple keys exist under the same account, they all draw from the same shared RPM/ITPM/OTPM pool.

Can prompt caching help with Claude API rate limits?

Yes, indirectly. ITPM and OTPM are counted independently of RPM, so a workload with large repeated system prompts or documents can hit a token ceiling before a request ceiling. Prompt caching reduces the input tokens counted per call for repeated static content, easing pressure on the token-based limits.

Claude API Rate Limits: Handling 429 Errors (2026)

How Claude's API rate limits work: RPM/ITPM/OTPM, response headers, 429 vs 529 errors, exponential backoff, and client-side throttling.

Every Claude API key operates under rate limits — caps on how many requests and tokens you can send per minute, tied to your account's usage tier. Hit one and you get a 429 response instead of a completion. For a single prototype script, that rarely matters. For a production agent, a batch job, or anything serving concurrent users, understanding these limits — and building around them — is the difference between a system that degrades gracefully and one that falls over during a traffic spike.

This guide covers how Claude's rate limits work, how to read the response headers that tell you exactly where you stand, and the retry and backoff patterns that keep your application running smoothly when you brush up against a ceiling.

How Claude's Rate Limits Work

Anthropic enforces limits along three dimensions simultaneously, and you can be capped by whichever one you hit first:

RPM — requests per minute
ITPM — input tokens per minute
OTPM — output tokens per minute

Your actual numbers depend on your usage tier, which is determined by your account's spending history — new accounts start at the lowest tier with conservative limits, and tiers increase automatically as your account accrues usage and payment history. There is no manual upgrade request needed in most cases; check your exact current limits in the Anthropic Console under Settings → Limits, since they vary by tier and model and change over time.

Limits apply per organization, not per API key — if you have multiple keys under one account, they all draw from the same pool.

Reading Rate Limit Headers

Every API response — successful or not — includes headers telling you exactly how much headroom you have left. Check these instead of guessing:

Header	Meaning
`anthropic-ratelimit-requests-limit`	Max requests allowed in the current window
`anthropic-ratelimit-requests-remaining`	Requests left before you hit the cap
`anthropic-ratelimit-requests-reset`	Timestamp when the request count resets
`anthropic-ratelimit-input-tokens-remaining`	Input tokens left in the current window
`anthropic-ratelimit-output-tokens-remaining`	Output tokens left in the current window
`retry-after`	Seconds to wait before retrying (present on 429 responses)

Reading these headers proactively — and slowing down before you hit zero — is far better UX than waiting for a 429 and reacting after the fact.

import anthropic

client = anthropic.Anthropic()

response = client.messages.with_raw_response.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello, Claude"}]
)

print("Requests remaining:", response.headers.get("anthropic-ratelimit-requests-remaining"))
print("Input tokens remaining:", response.headers.get("anthropic-ratelimit-input-tokens-remaining"))
print("Output tokens remaining:", response.headers.get("anthropic-ratelimit-output-tokens-remaining"))

message = response.parse()
print(message.content[0].text)

What a 429 Looks Like

When you exceed a limit, the API returns HTTP 429 with an error body identifying which limit you crossed:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-minute rate limit"
  }
}

The response also includes a retry-after header telling you exactly how many seconds to wait. Respect it — retrying immediately just produces another 429 and wastes a request slot you don't have.

A separate error type, overloaded_error (HTTP 529), means Anthropic's infrastructure is under heavy load rather than your account hitting its own limit. Treat it the same way: back off and retry, since it is not something a rate limit increase would fix.

Exponential Backoff in Python

The official SDK already retries transient errors (including 429 and 529) automatically up to a default of 2 retries with backoff. For custom retry logic — or if you want to log and monitor retries — implement your own backoff loop:

import time
import random
import anthropic

client = anthropic.Anthropic()

def call_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages
            )
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            retry_after = e.response.headers.get("retry-after")
            wait = float(retry_after) if retry_after else (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
        except anthropic.APIStatusError as e:
            if e.status_code == 529 and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise

response = call_with_backoff([{"role": "user", "content": "Summarize this quarter's revenue."}])
print(response.content[0].text)

Preferring the server's retry-after value over a fixed exponential curve matters — Anthropic knows exactly when your window resets, and using that number avoids both under-waiting (another 429) and over-waiting (idle time you didn't need).

Configuring SDK Retries Directly

Both official SDKs let you configure retry behavior at the client level instead of writing your own loop:

import anthropic

# Python: set max_retries on the client
client = anthropic.Anthropic(max_retries=5)

# Or override per-request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
).with_options(max_retries=2)

// TypeScript: same pattern
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ maxRetries: 5 });

const message = await client.messages.create(
  {
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }]
  },
  { maxRetries: 2 }
);

The SDK's built-in retry already applies exponential backoff with jitter to 429 and 5xx responses, so for most applications, bumping max_retries is enough — you don't need custom retry code unless you want fine-grained logging or per-request retry-after handling.

Client-Side Rate Limiting (Staying Under the Ceiling)

Reactive retries handle occasional spikes, but for sustained high-throughput workloads — bulk classification, large document pipelines — it is more efficient to throttle your own request rate so you rarely hit a 429 at all. A simple token-bucket limiter around your request loop does this:

import time
from collections import deque

class RateLimiter:
    def __init__(self, max_requests_per_minute):
        self.max_requests = max_requests_per_minute
        self.request_times = deque()

    def wait_if_needed(self):
        now = time.time()
        while self.request_times and self.request_times[0] < now - 60:
            self.request_times.popleft()
        if len(self.request_times) >= self.max_requests:
            sleep_time = 60 - (now - self.request_times[0])
            if sleep_time > 0:
                time.sleep(sleep_time)
        self.request_times.append(time.time())

limiter = RateLimiter(max_requests_per_minute=50)

for prompt in prompts:
    limiter.wait_if_needed()
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )

Set max_requests_per_minute comfortably below your actual tier limit (leave headroom for other traffic on the same account) rather than trying to max it out exactly.

When to Use the Batch API Instead

If you are processing a large volume of independent requests and hitting rate limits from sequential synchronous calls, the better fix is often not a smarter retry loop — it's switching to the Message Batches API. Batches operate under separate, much higher throughput limits than synchronous requests, cost 50% less, and remove the need for client-side throttling entirely since Anthropic manages the pacing on their end.

Situation	Best Approach
Interactive chat, single user	Synchronous requests, SDK default retries are enough
Concurrent multi-user production app	Client-side rate limiter + exponential backoff on 429
Large one-off bulk job (thousands of items)	Batch API — sidesteps synchronous rate limits entirely
Nightly scheduled processing	Batch API or a throttled loop run overnight
Agent loop with tool calls	Backoff with retry-after respected; consider prompt caching to cut token volume per call

Reducing Token Usage to Stay Under Limits

Since ITPM and OTPM are counted independently from RPM, a workload can hit a token ceiling well before it hits a request ceiling — especially with long system prompts or large context windows. Two changes shrink token pressure without changing what you're asking Claude to do: prompt caching for repeated static context (system prompts, tool definitions, long documents reused across calls), and trimming context window bloat by summarizing or dropping irrelevant history rather than replaying an entire conversation on every turn.

The Bottom Line

Rate limits are not an edge case to patch around after a production incident — they are a normal part of the API's operating envelope and should be designed for from the start. Read the response headers instead of guessing, respect retry-after on 429s, configure sensible max_retries on the SDK client, and add client-side throttling for any sustained high-volume workload. For genuinely large batch jobs, skip the problem entirely and move to the Batch API, which operates under separate, higher limits built for exactly this use case.