Hybrid Local + Cloud Systems.

Get the best of both worlds -- local privacy for sensitive work, cloud power for tasks that need frontier intelligence.

After this lesson you'll know

How to design a hybrid routing system that sends tasks to the right backend
Building a unified API that abstracts local and cloud models
Cost optimization strategies using local models as the default tier
Failover patterns for reliability across local and cloud

The Case for Hybrid

Pure local has limitations: smaller context windows, weaker reasoning on complex tasks, no vision capabilities on most models. Pure cloud has risks: privacy exposure, ongoing costs, vendor dependency. The answer for most serious users is hybrid -- and the key is building a system that routes intelligently.

The hybrid principle: default local, escalate to cloud. Every request starts at the local model. Only when the task demonstrably requires frontier capability (or the user explicitly requests it) does the request go to a cloud API. This minimizes cost, maximizes privacy, and still gives you access to the best models when you need them.

Building a Unified API Gateway

A unified API gateway gives your applications one endpoint that routes to the right backend. Every tool in your stack talks to the gateway; the gateway decides whether to use Ollama or a cloud API:

Simple Python Gateway

import requests

ROUTES = {
    "local": "http://localhost:11434/v1/chat/completions",
    "cloud": "https://api.anthropic.com/v1/messages",
}

def route_request(messages, tier="local", model=None):
    if tier == "local":
        model = model or "qwen2.5:14b"
        r = requests.post(ROUTES["local"], json={
            "model": model,
            "messages": messages
        })
        return r.json()["choices"][0]["message"]["content"]

    elif tier == "cloud":
        model = model or "claude-sonnet-4-20250514"
        headers = {
            "x-api-key": os.environ["ANTHROPIC_API_KEY"],
            "content-type": "application/json",
            "anthropic-version": "2023-06-01"
        }
        r = requests.post(ROUTES["cloud"], headers=headers,
            json={
                "model": model, "max_tokens": 4096,
                "messages": messages
            })
        return r.json()["content"][0]["text"]

# Usage: default local
answer = route_request(messages, tier="local")

# Escalate to cloud for complex reasoning
answer = route_request(messages, tier="cloud")

Ollama's OpenAI-compatible API (/v1/chat/completions) means most tools designed for OpenAI work with local models by changing the base URL. Your gateway exploits this compatibility.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.