Deploying Fine-Tuned Models.
From trained weights to production endpoint with latency, cost, and reliability guarantees.
After this lesson you'll know
- Deployment options: self-hosted, serverless, and managed inference
- Quantization for inference: GPTQ, AWQ, and GGUF formats
- Serving frameworks: vLLM, TGI, and Ollama
- Production hardening: load balancing, monitoring, and autoscaling
Deployment Architecture Options
Three approaches, each with different tradeoff profiles: **Option 1 - Self-hosted GPU server:** ``` How: Rent a GPU instance, run a serving framework (vLLM, TGI) Cost: $0.50-4.00/hour (fixed, regardless of traffic) Latency: Lowest (no cold starts, dedicated hardware) Control: Maximum (custom batching, caching, routing) Scaling: Manual or with container orchestration (K8s) Best for: Steady traffic, latency-sensitive apps, data privacy Provider options: RunPod Serverless or Dedicated Lambda Labs AWS/GCP/Azure GPU instances ``` **Option 2 - Serverless inference:** ``` How: Upload model to a serverless platform, pay per token Cost: $0.001-0.01 per 1K tokens (scales to zero) Latency: Higher (cold starts of 5-30 seconds) Control: Limited (platform manages infrastructure) Scaling: Automatic (handles traffic spikes) Best for: Variable traffic, prototyping, cost optimization Providers: Modal (Python-native, fast cold starts) Replicate (simple API, model marketplace) Together AI (competitive pricing) Fireworks AI (fast inference) ``` **Option 3 - Managed fine-tuning + hosting:** ``` How: Fine-tune via API, model is automatically hosted Cost: Standard API pricing (often higher per token) Latency: Good (provider optimizes serving) Control: Minimal (black box) Scaling: Automatic Best for: Teams without ML ops expertise Providers: OpenAI Fine-Tuning API Anthropic Fine-Tuning (via partners) Google Vertex AI ```
Start with Option 2 (serverless) for initial deployment. Move to Option 1 (self-hosted) when you need consistent latency or your costs exceed what a dedicated GPU would cost. The crossover point is typically 50,000-100,000 tokens per hour sustained.
This lesson is for Pro members
Unlock all 518+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.