Scaling Strategies.

Horizontal, vertical, and queue-based scaling for AI workloads.

After this lesson you'll know

Why AI workloads scale differently than traditional web services
Horizontal scaling patterns: load balancing, stateless design, provider pooling
Queue-based architectures for long-running AI tasks
Auto-scaling strategies that balance cost and performance

AI Scaling is Different

Traditional web scaling assumes requests are fast (sub-100ms), stateless, and cheap. AI workloads violate all three assumptions: - **Slow**: A single model call can take 5-30 seconds. Streaming helps UX but doesn't reduce compute time. - **Expensive**: Each request consumes real resources (tokens, GPU time) that cost money. You can't just add servers to make costs disappear. - **Bursty**: AI usage patterns spike unpredictably. A viral tweet mentioning your product can 10x traffic in an hour. - **Rate-limited**: Third-party APIs impose per-minute and per-day limits that become your bottleneck. These characteristics mean standard auto-scaling rules (scale when CPU > 70%) don't work. You need AI-aware scaling strategies.

The rate limit wall: OpenAI's Tier 1 limits are ~500 RPM for GPT-4. If you have 100 concurrent users each making 5 requests per minute, you've already exceeded your limit. Scaling your servers does nothing -- the bottleneck is the API provider, not your infrastructure.

Horizontal Scaling: Provider Pooling

When a single API provider can't handle your load, distribute across multiple providers. This is provider pooling -- the AI equivalent of database read replicas. ```python class ProviderPool: def __init__(self): self.providers = [ Provider("anthropic", rpm_limit=1000, weight=0.5), Provider("openai", rpm_limit=500, weight=0.3), Provider("local_llama", rpm_limit=9999, weight=0.2), ] async def call(self, prompt, requirements): # Sort by available capacity available = [ p for p in self.providers if p.current_rpm < p.rpm_limit * 0.9 and p.supports(requirements) ] if not available: return await self.queue_for_later(prompt, requirements) # Weighted random selection among available providers provider = weighted_choice(available) try: return await provider.call(prompt) except RateLimitError: provider.mark_limited() # Retry with next available provider return await self.call(prompt, requirements) ``` Provider pooling gives you three benefits: higher aggregate throughput, fault tolerance (one provider's outage doesn't take you down), and negotiating leverage (you're not locked into a single vendor).

Consistency warning: Different providers produce different outputs for the same prompt. If output consistency matters (e.g., brand voice), restrict pooling to same-family models or add a normalization layer that standardizes outputs across providers.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.