← Back to Blog

Local-First AI: Build Systems Without the Cloud

Run AI models, embeddings, and memory on your own hardware. Architecture guide for local-first AI with real production numbers.


Every AI product you use sends your data somewhere else. Your prompts, your documents, your code, your medical questions, your half-formed ideas — all routed through servers you do not control, logged in databases you cannot audit, retained under policies that change without notice.

Local-first AI is the exit. You run the models on your own hardware. Your data never leaves your machine. Your memory persists across sessions in a database you own. Your embeddings live in a vector store on your local disk. No API keys. No usage caps. No surprise bills. No terms of service that let a company train on your conversations.

We run this architecture in production at Like One. Over 1,000 brain entries. Over 7,700 vectors across 11 collections. Semantic search, graph relationships, and hybrid retrieval — all on a single Mac. This guide shows you exactly how to build it.

Why Local-First AI Matters Now

Three shifts converged in 2026 that make local-first AI practical for the first time:

Hardware caught up. Apple Silicon's unified memory architecture gives a laptop direct GPU access to 64GB or more of RAM. An M3 Max runs a 30-billion parameter model at interactive speed. An M4 runs it faster. You do not need a data center — your desk is the data center.

Models got efficient. Quantization techniques like GGUF and QLoRA compress models to 4-bit precision with minimal quality loss. A 14-billion parameter model that needed 28GB in full precision runs in 8GB quantized. Models that required a cluster two years ago now run on a laptop.

Privacy became non-negotiable. Enterprise AI policies tightened. Healthcare and legal cannot send data to third-party APIs. The EU AI Act introduced data residency requirements. And individuals increasingly realize that their AI conversations — the questions they ask, the problems they face, the ideas they explore — are among the most intimate data they generate.

Local-first does not mean cloud-never. It means your default is local, and cloud is an intentional choice for specific tasks. Apple Foundation Models runs on-device. Ollama runs local models. And when you need frontier reasoning, you call Claude — knowing exactly what data you are sending and why.

The Local-First AI Stack

A production local-first AI system has four layers: inference, embeddings, memory, and orchestration. Each layer has mature, battle-tested tooling. Here is what we run.

Layer 1: Local Inference

Ollama is the standard runtime for local models. Install it, pull a model, and start generating:

ollama pull qwen3:14b
ollama run qwen3:14b "Explain retrieval-augmented generation in three sentences."

For programmatic access, Ollama exposes an OpenAI-compatible API at localhost:11434:

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "qwen3:14b",
    "prompt": "What is the Model Context Protocol?",
    "stream": False
})
print(response.json()["response"])

On Apple Silicon, you also get Apple Foundation Models — a 3-billion parameter model running on the Neural Engine. It is six times faster than Ollama for simple structured tasks because it uses dedicated hardware instead of GPU compute. We use both: Apple FM for fast structured output (quizzes, classification, summaries) and Ollama for complex reasoning and embeddings.

Model Selection Guide

TaskModelSizeSpeed
Chat and reasoningqwen3:14b8.5GB~25 tok/s (M3 Max)
Deep analysisdeepseek-r1:32b19GB~12 tok/s (M3 Max)
Embeddings (1024d)mxbai-embed-large670MB37ms/query (M3 Max)
Fast structured outputApple Foundation Models~3B on-device1.1s response
Lightweight tasksqwen3:4b2.6GB~60 tok/s (M3 Max)

Start with the smallest model that handles your task. You can always route complex queries to a larger model — but you cannot get back the latency you waste on an oversized default.

Layer 2: Local Embeddings

Embeddings convert text into vectors — numerical representations that capture semantic meaning. They power search, similarity, clustering, and retrieval-augmented generation. Most tutorials tell you to use OpenAI's embedding API. Do not.

import requests

def embed(text: str) -> list[float]:
    response = requests.post("http://localhost:11434/api/embed", json={
        "model": "mxbai-embed-large",
        "input": text
    })
    return response.json()["embeddings"][0]

The mxbai-embed-large model produces 1024-dimensional vectors at 37 milliseconds per query on an M3. That is fast enough for real-time search. It runs entirely on your machine. No API key. No rate limits. No cost per embedding.

For storage, skip the heavyweight vector databases. You do not need Pinecone or Weaviate or a managed Qdrant cluster. You need sqlite-vec — a SQLite extension that adds vector search to any SQLite database:

import sqlite3
import sqlite_vec

db = sqlite3.connect("brain.db")
db.enable_load_extension(True)
sqlite_vec.load(db)

# Create a vector table
db.execute("""
    CREATE VIRTUAL TABLE IF NOT EXISTS vec_entries
    USING vec0(embedding float[1024])
""")

# Insert a vector
db.execute("INSERT INTO vec_entries(rowid, embedding) VALUES (?, ?)",
           [1, embed("persistent memory in AI systems")])

# Search by similarity
query_vec = embed("how does AI remember things")
results = db.execute("""
    SELECT rowid, distance
    FROM vec_entries
    WHERE embedding MATCH ?
    ORDER BY distance
    LIMIT 5
""", [query_vec]).fetchall()

SQLite is the most deployed database engine on earth. It runs everywhere. It needs no server. It backs up with a file copy. Adding vector search to it means your AI memory is just a file — portable, inspectable, and yours.

Layer 3: Persistent Memory

Inference generates text. Embeddings enable search. But memory is what makes an AI system useful over time. Without persistent memory, every conversation starts from zero. The AI does not know what you worked on yesterday, what decisions you made, or what it already told you.

A local-first memory system needs three components:

  • Structured storage. Key-value pairs for facts, preferences, decisions, and state. SQLite handles this directly — no ORM needed, no migration framework, just tables.
  • Semantic search. Vector embeddings for finding related memories by meaning, not just keywords. This is what sqlite-vec provides.
  • Graph relationships. Connections between memories — this decision led to that outcome, this concept relates to that project, this person works on that team. Graph edges turn isolated facts into connected knowledge.

We combine all three in a single SQLite database. Our production system holds over 1,000 brain entries with over 7,700 vectors and over 2,400 graph edges — semantic connections, mention links, and reference chains that let the AI traverse knowledge the way humans do. For a deep dive on the memory architecture, read our persistent memory guide.

The critical design decision: embed continuously, not on-demand. We run an embedding cron job every five minutes that picks up new entries, generates vectors, and indexes them. When the AI searches memory, the vectors are already there. Search stays fast because the expensive work happened in the background.

Layer 4: Orchestration

The orchestration layer routes tasks to the right model, manages tool execution, and maintains conversation context. In a cloud-first world, this is your API gateway. In a local-first world, it is a Python script on your machine.

The Model Context Protocol (MCP) is the standard for connecting AI models to tools and data sources. MCP servers expose capabilities — file access, database queries, API calls, system commands — and AI models consume them through a unified protocol. Build an MCP server once, and any MCP-compatible client can use it.

For agent orchestration, the pattern is a loop: the model reasons about the task, selects a tool, executes it, observes the result, and repeats until done. We cover this in detail in our agentic loops explainer and our Agent SDK tutorial.

Hybrid Architecture: Local Default, Cloud When Needed

Pure local-first is a philosophy, not a prison. Some tasks genuinely need frontier models. Writing a 3,000-word technical blog post. Reviewing a complex pull request. Analyzing a legal document. These tasks benefit from Claude's million-token context window and deep reasoning capabilities.

The key is intentional routing. Every query hits local first. Only when local cannot handle it — context too long, reasoning too complex, quality threshold not met — does the system escalate to cloud.

def route_query(query: str, context_tokens: int) -> str:
    # Fast structured tasks: Apple Foundation Models
    if is_structured_task(query):
        return "apple_fm"
    # Short context, standard reasoning: local Ollama
    if context_tokens < 8000:
        return "ollama_local"
    # Long context or complex reasoning: cloud
    return "claude_api"

In practice, we find that 80-90% of queries resolve locally. Embeddings are always local. Memory operations are always local. Simple generation, classification, and summarization are always local. The cloud handles the remaining 10-20% — the hard problems where a larger model genuinely produces better results.

This is not about ideology. It is about economics. Local inference costs electricity. Cloud inference costs tokens. When you run thousands of queries per day for embeddings, search, and background processing, the economics of local-first become overwhelming. Our embedding pipeline processes entries every five minutes. At cloud API pricing, that would cost hundreds of dollars per month. Locally, it costs the electricity to run an M3 — which we are running anyway.

Security and Backup

Local-first means you own your data. That also means you own your backup strategy.

SQLite databases are single files. Back them up with a file copy — but respect the write-ahead log (WAL). A naive file copy during a write can produce a corrupt backup. Use SQLite's backup API or checkpoint the WAL first:

import sqlite3
import shutil

def safe_backup(source: str, dest: str):
    db = sqlite3.connect(source)
    db.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    db.close()
    shutil.copy2(source, dest)

We run daily WAL-safe backups to a local directory with 14-day retention, plus encrypted offsite backups to iCloud Drive using AES-256 encryption. The encryption key lives in the macOS Keychain — not in a config file, not in an environment variable, not in a git repository.

For secrets management, use your operating system's keychain. macOS Keychain, Linux Secret Service, Windows Credential Manager. These are purpose-built, hardware-backed credential stores. There is no reason to store API keys in .env files when every major OS provides encrypted credential storage.

Performance Benchmarks

Real numbers from our production system running on an M3 Max with 64GB unified memory:

OperationTimeNotes
Embedding (mxbai-embed-large)37msSingle query, 1024 dimensions
Vector search (sqlite-vec)2-5msTop-5 from 7,700+ vectors
FTS5 text search0.3msBM25 ranking
Hybrid search (vec + FTS5)45msCombined semantic + keyword
Hybrid search + reranking~2.8sLLM reranker for high-stakes queries
Apple FM generation1.1sSimple text response
Apple FM structured output3.8s@Generable quiz question
Ollama qwen3:14b generation~6-10sParagraph-length response
Full brain boot (all context)<1sLoad session state + keys

These numbers are fast enough for interactive use. The hybrid search path — the one that powers our AI assistant's memory — returns results in 45 milliseconds. That is imperceptible. Your AI remembers everything, instantly, without touching the network.

Getting Started: Minimum Viable Local AI

You do not need our full stack to start. Here is the minimum viable local-first AI system:

  1. Install Ollama. One command: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull two models. One for chat (ollama pull qwen3:4b), one for embeddings (ollama pull mxbai-embed-large).
  3. Create a SQLite database with sqlite-vec. pip install sqlite-vec. Create one table for text, one virtual table for vectors.
  4. Write an embed-and-store script. Takes text, generates an embedding, stores both in the database. Run it on your notes, your code comments, your meeting transcripts — whatever you want your AI to remember.
  5. Write a search-and-generate script. Takes a query, finds the most relevant stored entries by vector similarity, passes them as context to the chat model. This is RAG — running entirely on your machine.

That is a complete local-first AI system in five steps. No accounts. No API keys. No cloud services. Just your machine, your data, and models that run on your hardware.

From there, add layers as you need them: graph relationships for connected knowledge, FTS5 for keyword search, scheduled embedding for continuous indexing, MCP servers for tool integration. Each layer adds capability without adding external dependencies.

Common Mistakes

After building and running local-first AI systems for months, these are the mistakes we see most often:

  • Starting with the biggest model. A 70B model on 32GB RAM will swap to disk and crawl. Start with 4B or 7B. Upgrade only when quality is the bottleneck, not before.
  • Using a vector database instead of sqlite-vec. Unless you have millions of vectors and need distributed search, a vector database adds operational complexity with no benefit. SQLite handles tens of thousands of vectors at sub-5ms latency.
  • Embedding on-demand instead of in the background. If you embed at query time, your search has a 37ms tax on every query plus the embedding of the document itself. Embed continuously in the background so search stays fast.
  • Skipping the hybrid search path. Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine FTS5 and sqlite-vec for results that catch both.

The Future Is Local

The trajectory is clear. Models are getting smaller and faster. Hardware is getting more capable. Privacy regulations are getting stricter. The era of sending every AI query to a remote server is ending — not because the cloud is bad, but because the local alternative is now good enough for most tasks and strictly better for privacy, latency, and cost.

Build local-first. Use cloud when you need it. Own your data, your models, and your memory. That is not a philosophical stance — it is an engineering decision backed by the numbers.

For hands-on tutorials on the tools mentioned in this guide, explore our Academy — 53 courses covering AI architecture, MCP, RAG, persistent memory, and more. First three lessons of every course are free.

Need Help Building Your Local AI Stack?

We architect and build local-first AI systems — from SQLite memory to MCP integration to hybrid cloud routing. Talk to us.


Keep learning — for free

50+ AI courses. 590+ lessons. No paywall for starters.

Need help building this?

We build MCP servers, Claude workflows, and AI agents for teams. Strategy calls start at $150/hr.