Cost Optimization at Scale.

Caching, model routing, and token budgets that save thousands monthly.

After this lesson you'll know

How to audit and attribute AI costs across your system
Token budget strategies that prevent runaway spending
Model routing economics: when cheap models beat expensive ones
Prompt engineering techniques that reduce costs 30-50% without quality loss

The Cost Iceberg

Most teams know their monthly API bill. Few know where the money actually goes. Cost optimization starts with attribution -- understanding exactly which features, users, and requests drive spending. ```python class CostTracker: def log_call(self, model, input_tokens, output_tokens, metadata): cost = self.calculate_cost(model, input_tokens, output_tokens) self.store({ "cost_usd": cost, "model": model, "feature": metadata.get("feature"), "user_tier": metadata.get("user_tier"), "cache_miss": metadata.get("cache_miss", True), "timestamp": datetime.utcnow(), }) def calculate_cost(self, model, input_tok, output_tok): rates = { "claude-opus-4-20250514": {"input": 15.0, "output": 75.0}, "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0}, "claude-haiku": {"input": 0.25, "output": 1.25}, "gpt-4o": {"input": 2.5, "output": 10.0}, "gpt-4o-mini": {"input": 0.15, "output": 0.60}, } r = rates[model] return (input_tok * r["input"] + output_tok * r["output"]) / 1_000_000 ``` Once you have attribution, patterns emerge. Common findings: 20% of features drive 80% of cost. Free-tier users on expensive models burn money. Retry storms on failed calls double bills silently. Long system prompts duplicated across every call waste input tokens.

Real numbers: Claude Opus at 15/75 per MTok vs. Haiku at 0.25/1.25 per MTok means Opus is 60x more expensive on input. A 2,000-token system prompt sent with every request costs $0.03/call on Opus vs $0.0005/call on Haiku. At 10K calls/day, that's $300/day vs $5/day -- just for the system prompt.

Token Budget Strategies

Token budgets set hard limits on how many tokens a request can consume, preventing runaway costs from long inputs, verbose outputs, or unbounded agent loops. ```python class TokenBudget: def __init__(self, max_input=4000, max_output=2000, max_total=8000): self.max_input = max_input self.max_output = max_output self.max_total = max_total self.spent = 0 def can_afford(self, estimated_tokens): return self.spent + estimated_tokens <= self.max_total def truncate_context(self, documents, budget): """Fit documents within budget, prioritizing by relevance.""" selected = [] remaining = budget for doc in sorted(documents, key=lambda d: d.score, reverse=True): if doc.token_count <= remaining: selected.append(doc) remaining -= doc.token_count return selected ``` Three levels of budget enforcement: 1. **Request-level**: Cap tokens per individual API call. Prevents single-call blowups. 2. **Session-level**: Cap total tokens for an entire user session or agent run. Prevents infinite loops. 3. **User-level**: Daily or monthly caps per user or API key. Prevents abuse and enables tiered pricing.

Output token trick: Setting a lower max_tokens on the API call doesn't just save money -- it forces the model to be concise. For many tasks, 500 output tokens produces a better response than 4,000 because the model prioritizes essential information.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.