Caching & Latency Optimization.

Semantic caching, KV cache, and edge strategies for sub-second AI responses.

After this lesson you'll know

How semantic caching differs from exact-match caching and when to use each
KV cache mechanics and how to exploit them for faster inference
Edge caching strategies for globally distributed AI applications
Cache invalidation strategies for AI-generated content

Why Caching Changes Everything for AI

A model call costs time and money. A cache hit costs neither. For AI systems, caching is the single most impactful optimization because model calls are both slow (1-30 seconds) and expensive ($0.001-$0.50 per call). The challenge is that traditional exact-match caching rarely works for AI. Users ask the same question in different ways: "What's the refund policy?" and "How do I get my money back?" are semantically identical but textually different. This is where semantic caching enters. Three caching layers for AI systems: 1. **Exact-match cache**: Hash the prompt, return cached response for identical inputs. Simple, fast, and effective for programmatic queries (APIs, structured inputs). 2. **Semantic cache**: Embed the query, find similar cached queries above a similarity threshold, return the cached response. Handles natural language variation. 3. **KV cache / Prompt cache**: Reuse the model's internal key-value cache for shared prompt prefixes. Reduces latency and cost for requests with common system prompts.

Impact at scale: A well-tuned semantic cache typically achieves 30-50% hit rates on customer support queries, reducing both average latency (from 3s to 50ms) and costs proportionally. For FAQ-heavy applications, hit rates can exceed 70%.

Building a Semantic Cache

A semantic cache stores query embeddings alongside their responses. When a new query arrives, it's embedded and compared against cached entries using cosine similarity. ```python import numpy as np class SemanticCache: def __init__(self, embedding_model, threshold=0.92): self.embedding_model = embedding_model self.threshold = threshold self.entries = [] # In production, use a vector DB async def get(self, query): query_embedding = await self.embedding_model.embed(query) best_match = None best_score = 0 for entry in self.entries: score = cosine_similarity(query_embedding, entry["embedding"]) if score > best_score and score >= self.threshold: best_score = score best_match = entry if best_match: return CacheHit( response=best_match["response"], similarity=best_score, original_query=best_match["query"], ) return None async def set(self, query, response, ttl=3600): embedding = await self.embedding_model.embed(query) self.entries.append({ "query": query, "embedding": embedding, "response": response, "created_at": time.time(), "ttl": ttl, }) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) ``` The threshold is critical. Too low (0.80) and you serve wrong answers for different questions. Too high (0.98) and the cache rarely hits. Start at 0.92 and tune based on your domain -- narrow domains (technical docs) can go lower, broad domains (general chat) need higher thresholds.

Embedding model choice: Use a fast, cheap embedding model for cache lookups (like text-embedding-3-small). The embedding call itself must be faster than the savings from a cache hit. At ~10ms per embedding vs. ~3s per model call, this pays for itself immediately.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.