Managing AI Compute & API Costs

AI infrastructure can be cheap or catastrophically expensive. The difference is strategy. Every technique in this lesson exists because someone learned the hard way that AI costs don't behave like traditional hosting costs.

What you'll learn

The real cost breakdown of AI API calls
Caching strategies that cut costs by 40-70%
Model selection: when cheaper models are actually better
Building a tiered architecture that minimizes expensive API calls

The Reality

Understanding AI Cost Structure

AI API pricing is based on tokens — chunks of text roughly equivalent to 3/4 of a word. You pay for both input tokens (what you send) and output tokens (what the model generates). Output tokens typically cost 3-5x more than input tokens.

A single conversation turn with a large context window can cost $0.05-$0.50. Multiply that by thousands of users and dozens of interactions per user, and you're looking at real money. The organizations that survive are the ones that optimize ruthlessly.

Your system prompt alone might be 2,000 tokens. If that prompt goes with every request, you're paying for it every single time. This is the first place to optimize — make your system prompts as concise as possible.

The Biggest Win

Caching: Stop Paying Twice

Semantic caching: Before sending a query to an LLM, check if a sufficiently similar query has been answered recently. Use vector similarity to find near-matches. If someone asked "how do I deploy to Vercel?" five minutes ago, the answer to "deploying on Vercel?" is probably the same.

Response caching: For deterministic operations (embeddings, classifications, structured data extraction), cache the result keyed on the input hash. Embeddings for the same text never change — compute them once and store them forever.

Prompt caching: Some providers (including Anthropic) offer prompt caching — if you send the same system prompt repeatedly, you pay full price once and a fraction for subsequent uses. Structure your requests to take advantage of this.

A well-implemented caching layer typically reduces AI API costs by 40-70%. That's not an optimization — it's a survival strategy.

Right-Sizing

Model Selection Strategy

Not every task needs the most powerful model. Classification, extraction, and simple Q&A can often be handled by smaller, cheaper models. Reserve the expensive models for tasks that actually need their capabilities.

Tiered approach: Use free or cheap embeddings (HuggingFace BGE-small) for semantic search. Use a mid-tier model (Claude Haiku, GPT-4o-mini) for simple tasks. Reserve the flagship model (Claude Opus, GPT-4o) for complex reasoning that genuinely needs it.

RAG before generation: Before asking an LLM to generate an answer, check if the answer already exists in your knowledge base. A vector search costs fractions of a cent. An LLM call costs 100x more. Let your database do the cheap work first.

This tiered architecture is how Like One works: free embeddings handle similarity search, and the expensive model only gets called when the brain can't answer from stored knowledge.

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.