Cost Optimization
Cut your AI bill by 90%. Local models for routine, cloud only for complex.
Most businesses overspend on AI by 10x because they send every request to expensive cloud APIs. The sovereign stack runs routine tasks locally for free and routes only the hardest work to the cloud. This lesson shows you exactly how to build that routing.
What you'll learn
- The task routing framework: which tasks go local vs. cloud
- Measuring cost per task and optimizing the split
- Token reduction techniques that cut costs without cutting quality
- Building a cost dashboard that tracks spending in real time
The Cost Routing Decision
Every AI request has a complexity level. Simple tasks (classification, summarization, template filling) run perfectly on local models. Complex tasks (long-form analysis, nuanced writing, multi-step reasoning) benefit from cloud frontier models. The key is routing correctly:
Local (free): Email triage, content classification, data extraction, template filling, simple Q&A, code formatting, JSON generation, sentiment analysis. These are pattern-matching tasks that 7B models handle excellently.
Cloud (paid): Long-form blog posts, complex code generation, multi-step reasoning chains, nuanced client communication, strategic planning, creative writing. These need frontier model capabilities.
Hybrid: Draft locally, polish on cloud. Local model generates a rough draft (free), cloud model refines it into final quality (one API call instead of iterating). This cuts cloud usage by 60-80% for content tasks.
The Smart Router
// Smart model router -- chooses local vs. cloud per request
function routeRequest(task) {
// Tasks that always go local (free)
const localTasks = [
'classify', 'triage', 'summarize_short', 'extract_data',
'format_json', 'template_fill', 'sentiment', 'translate_simple'
];
// Tasks that always go cloud (quality matters)
const cloudTasks = [
'blog_post', 'client_email_complex', 'strategic_plan',
'code_architecture', 'legal_review', 'creative_writing'
];
if (localTasks.includes(task.type)) {
return { model: 'ollama/qwen2.5:7b', cost: 0, reason: 'routine task' };
}
if (cloudTasks.includes(task.type)) {
return {
model: 'claude-sonnet-4-20250514',
cost: estimateCost(task),
reason: 'requires frontier quality'
};
}
// Default: try local first, escalate to cloud if quality is poor
return {
model: 'ollama/qwen2.5:7b',
fallback: 'claude-sonnet-4-20250514',
cost: 0,
reason: 'try local, escalate if needed'
};
}
// Cost estimation based on token count
function estimateCost(task) {
const inputTokens = Math.ceil(task.prompt.length / 4);
const outputTokens = task.maxTokens || 1000;
// Claude Sonnet pricing (approximate)
return ((inputTokens * 0.003) + (outputTokens * 0.015)) / 1000;
}Token Reduction Techniques
Even when you must use cloud APIs, you can reduce how many tokens each request consumes:
Prompt compression. Rewrite system prompts to be concise. A 2000-token system prompt that could be 500 tokens wastes 1500 tokens on every request. Over 100 requests/day, that is 150,000 wasted tokens.
Context windowing. Do not send the entire conversation history with every request. Send the last 5-10 messages plus a summary of earlier context. This keeps input tokens manageable as conversations grow.
Output capping. Set max_tokens to the minimum needed. A classification task needs 50 tokens, not 4000. A summary needs 200, not 2000. Over-allocating output tokens does not cost more (you pay for actual output), but constraining encourages concise responses.
Caching responses. If the same question is asked repeatedly (FAQ, standard classification), cache the response. Subsequent requests return the cached answer at zero cost. Invalidate the cache when the underlying data changes.
This lesson is for Pro members
Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.
Already a member? Sign in to access your lessons.