📚Academy
likeone
online

Cost Optimization at Scale.

Caching, model routing, and token budgets that save thousands monthly.

After this lesson you'll know

  • How to audit and attribute AI costs across your system
  • Token budget strategies that prevent runaway spending
  • Model routing economics: when cheap models beat expensive ones
  • Prompt engineering techniques that reduce costs 30-50% without quality loss

The Cost Iceberg

Most teams know their monthly API bill. Few know where the money actually goes. Cost optimization starts with attribution -- understanding exactly which features, users, and requests drive spending. ```python class CostTracker: def log_call(self, model, input_tokens, output_tokens, metadata): cost = self.calculate_cost(model, input_tokens, output_tokens) self.store({ "cost_usd": cost, "model": model, "feature": metadata.get("feature"), "user_tier": metadata.get("user_tier"), "cache_miss": metadata.get("cache_miss", True), "timestamp": datetime.utcnow(), }) def calculate_cost(self, model, input_tok, output_tok): rates = { "claude-opus-4-20250514": {"input": 15.0, "output": 75.0}, "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0}, "claude-haiku": {"input": 0.25, "output": 1.25}, "gpt-4o": {"input": 2.5, "output": 10.0}, "gpt-4o-mini": {"input": 0.15, "output": 0.60}, } r = rates[model] return (input_tok * r["input"] + output_tok * r["output"]) / 1_000_000 ``` Once you have attribution, patterns emerge. Common findings: 20% of features drive 80% of cost. Free-tier users on expensive models burn money. Retry storms on failed calls double bills silently. Long system prompts duplicated across every call waste input tokens.
Real numbers: Claude Opus at 15/75 per MTok vs. Haiku at 0.25/1.25 per MTok means Opus is 60x more expensive on input. A 2,000-token system prompt sent with every request costs $0.03/call on Opus vs $0.0005/call on Haiku. At 10K calls/day, that's $300/day vs $5/day -- just for the system prompt.

Token Budget Strategies

Token budgets set hard limits on how many tokens a request can consume, preventing runaway costs from long inputs, verbose outputs, or unbounded agent loops. ```python class TokenBudget: def __init__(self, max_input=4000, max_output=2000, max_total=8000): self.max_input = max_input self.max_output = max_output self.max_total = max_total self.spent = 0 def can_afford(self, estimated_tokens): return self.spent + estimated_tokens <= self.max_total def truncate_context(self, documents, budget): """Fit documents within budget, prioritizing by relevance.""" selected = [] remaining = budget for doc in sorted(documents, key=lambda d: d.score, reverse=True): if doc.token_count <= remaining: selected.append(doc) remaining -= doc.token_count return selected ``` Three levels of budget enforcement: 1. **Request-level**: Cap tokens per individual API call. Prevents single-call blowups. 2. **Session-level**: Cap total tokens for an entire user session or agent run. Prevents infinite loops. 3. **User-level**: Daily or monthly caps per user or API key. Prevents abuse and enables tiered pricing.
Output token trick: Setting a lower max_tokens on the API call doesn't just save money -- it forces the model to be concise. For many tasks, 500 output tokens produces a better response than 4,000 because the model prioritizes essential information.
🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai