What is local-first AI?

Local-first AI means running AI models, embeddings, and memory on your own hardware by default. Your data stays on your machine. Cloud APIs are used intentionally for specific tasks, not as the default for everything. This gives you better privacy, lower latency, no usage costs, and full control over your data.

What hardware do I need to run AI models locally?

Any Apple Silicon Mac (M1 or later) can run local AI models. For comfortable use with 14B parameter models, 16GB RAM is the minimum and 32GB is recommended. An M3 Max with 64GB can run 30B+ parameter models at interactive speed. On Linux or Windows, an NVIDIA GPU with 8GB+ VRAM handles most quantized models.

Is local AI as good as cloud AI like ChatGPT or Claude?

For many tasks, yes. Local 14B models handle chat, summarization, classification, and code completion well. For complex multi-step reasoning, long-context analysis, or frontier-quality content generation, cloud models like Claude still outperform. The best approach is hybrid: local by default, cloud when the task demands it.

How do I store embeddings locally without Pinecone?

Use sqlite-vec, a SQLite extension that adds vector search to any SQLite database. Install it with pip install sqlite-vec, create a virtual table with your vector dimensions, and query with nearest-neighbor search. It handles thousands of vectors with sub-5ms search times and requires no server or cloud account.

What is the best local embedding model?

mxbai-embed-large is an excellent choice for local embeddings. It produces 1024-dimensional vectors, runs at 37ms per query on Apple Silicon, and is available through Ollama with a single pull command. It handles English text well for search, clustering, and RAG applications.

How much does it cost to run AI locally?

The hardware cost is whatever Mac or GPU machine you already own. After that, the ongoing cost is electricity only. There are no per-token charges, no monthly subscriptions, no usage caps. For workloads that process thousands of queries daily — like embedding pipelines or memory search — local inference saves hundreds of dollars per month compared to cloud API pricing.

Can I build RAG entirely on local hardware?

Yes. Use Ollama for embeddings (mxbai-embed-large), sqlite-vec for vector storage and search, and any local chat model for generation. The entire retrieval-augmented generation pipeline runs on your machine. Our production system searches over 7,700 vectors in 45ms with hybrid semantic and keyword search, all locally.

What is the difference between Ollama and Apple Foundation Models?

Ollama runs open-source models (Llama, Qwen, Mistral, DeepSeek) on GPU compute. Apple Foundation Models runs Apple's 3B parameter model on the dedicated Neural Engine. Apple FM is faster for simple structured tasks (1.1s vs 6-10s) but limited to Apple devices and shorter context. Ollama supports larger models, longer context, and runs on any platform.

How do I back up a local AI brain safely?

SQLite databases using WAL mode need a checkpoint before copying to avoid corruption. Run PRAGMA wal_checkpoint(TRUNCATE) before the file copy. Automate daily backups with retention policies. For offsite backup, encrypt with AES-256 before uploading to any cloud storage. Store encryption keys in your OS keychain, never in config files.

What is the Model Context Protocol and why does it matter for local AI?

MCP is a standard protocol for connecting AI models to tools and data sources. It matters for local AI because it lets your local models access files, databases, APIs, and system commands through a unified interface. Build an MCP server once and any MCP-compatible client can use it — whether that client runs locally or in the cloud.

Local-First AI: Build Systems Without the Cloud

Run AI models, embeddings, and memory on your own hardware. Architecture guide for local-first AI with real production numbers.

Every AI product you use sends your data somewhere else. Your prompts, your documents, your code, your medical questions, your half-formed ideas — all routed through servers you do not control, logged in databases you cannot audit, retained under policies that change without notice.

Local-first AI is the exit. You run the models on your own hardware. Your data never leaves your machine. Your memory persists across sessions in a database you own. Your embeddings live in a vector store on your local disk. No API keys. No usage caps. No surprise bills. No terms of service that let a company train on your conversations.

We run this architecture in production at Like One. Over 1,000 brain entries. Over 7,700 vectors across 11 collections. Semantic search, graph relationships, and hybrid retrieval — all on a single Mac. This guide shows you exactly how to build it.

Why Local-First AI Matters Now

Three shifts converged in 2026 that make local-first AI practical for the first time:

Hardware caught up. Apple Silicon's unified memory architecture gives a laptop direct GPU access to 64GB or more of RAM. An M3 Max runs a 30-billion parameter model at interactive speed. An M4 runs it faster. You do not need a data center — your desk is the data center.

Models got efficient. Quantization techniques like GGUF and QLoRA compress models to 4-bit precision with minimal quality loss. A 14-billion parameter model that needed 28GB in full precision runs in 8GB quantized. Models that required a cluster two years ago now run on a laptop.

Privacy became non-negotiable. Enterprise AI policies tightened. Healthcare and legal cannot send data to third-party APIs. The EU AI Act introduced data residency requirements. And individuals increasingly realize that their AI conversations — the questions they ask, the problems they face, the ideas they explore — are among the most intimate data they generate.

Local-first does not mean cloud-never. It means your default is local, and cloud is an intentional choice for specific tasks. Apple Foundation Models runs on-device. Ollama runs local models. And when you need frontier reasoning, you call Claude — knowing exactly what data you are sending and why.

The Local-First AI Stack

A production local-first AI system has four layers: inference, embeddings, memory, and orchestration. Each layer has mature, battle-tested tooling. Here is what we run.

Layer 1: Local Inference

Ollama is the standard runtime for local models. Install it, pull a model, and start generating:

ollama pull qwen3:14b
ollama run qwen3:14b "Explain retrieval-augmented generation in three sentences."

For programmatic access, Ollama exposes an OpenAI-compatible API at localhost:11434:

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "qwen3:14b",
    "prompt": "What is the Model Context Protocol?",
    "stream": False
})
print(response.json()["response"])

On Apple Silicon, you also get Apple Foundation Models — a 3-billion parameter model running on the Neural Engine. It is six times faster than Ollama for simple structured tasks because it uses dedicated hardware instead of GPU compute. We use both: Apple FM for fast structured output (quizzes, classification, summaries) and Ollama for complex reasoning and embeddings.

Model Selection Guide

Task	Model	Size	Speed
Chat and reasoning	qwen3:14b	8.5GB	~25 tok/s (M3 Max)
Deep analysis	deepseek-r1:32b	19GB	~12 tok/s (M3 Max)
Embeddings (1024d)	mxbai-embed-large	670MB	37ms/query (M3 Max)
Fast structured output	Apple Foundation Models	~3B on-device	1.1s response
Lightweight tasks	qwen3:4b	2.6GB	~60 tok/s (M3 Max)

Start with the smallest model that handles your task. You can always route complex queries to a larger model — but you cannot get back the latency you waste on an oversized default.

Layer 2: Local Embeddings

Embeddings convert text into vectors — numerical representations that capture semantic meaning. They power search, similarity, clustering, and retrieval-augmented generation. Most tutorials tell you to use OpenAI's embedding API. Do not.

import requests

def embed(text: str) -> list[float]:
    response = requests.post("http://localhost:11434/api/embed", json={
        "model": "mxbai-embed-large",
        "input": text
    })
    return response.json()["embeddings"][0]

The mxbai-embed-large model produces 1024-dimensional vectors at 37 milliseconds per query on an M3. That is fast enough for real-time search. It runs entirely on your machine. No API key. No rate limits. No cost per embedding.

For storage, skip the heavyweight vector databases. You do not need Pinecone or Weaviate or a managed Qdrant cluster. You need sqlite-vec — a SQLite extension that adds vector search to any SQLite database:

import sqlite3
import sqlite_vec

db = sqlite3.connect("brain.db")
db.enable_load_extension(True)
sqlite_vec.load(db)

# Create a vector table
db.execute("""
    CREATE VIRTUAL TABLE IF NOT EXISTS vec_entries
    USING vec0(embedding float[1024])
""")

# Insert a vector
db.execute("INSERT INTO vec_entries(rowid, embedding) VALUES (?, ?)",
           [1, embed("persistent memory in AI systems")])

# Search by similarity
query_vec = embed("how does AI remember things")
results = db.execute("""
    SELECT rowid, distance
    FROM vec_entries
    WHERE embedding MATCH ?
    ORDER BY distance
    LIMIT 5
""", [query_vec]).fetchall()

SQLite is the most deployed database engine on earth. It runs everywhere. It needs no server. It backs up with a file copy. Adding vector search to it means your AI memory is just a file — portable, inspectable, and yours.

Layer 3: Persistent Memory

Inference generates text. Embeddings enable search. But memory is what makes an AI system useful over time. Without persistent memory, every conversation starts from zero. The AI does not know what you worked on yesterday, what decisions you made, or what it already told you.

A local-first memory system needs three components:

Structured storage. Key-value pairs for facts, preferences, decisions, and state. SQLite handles this directly — no ORM needed, no migration framework, just tables.
Semantic search. Vector embeddings for finding related memories by meaning, not just keywords. This is what sqlite-vec provides.
Graph relationships. Connections between memories — this decision led to that outcome, this concept relates to that project, this person works on that team. Graph edges turn isolated facts into connected knowledge.

We combine all three in a single SQLite database. Our production system — built on sovereign-brain, an open-source Python package — holds over 1,000 brain entries with over 7,700 vectors and over 2,400 graph edges — semantic connections, mention links, and reference chains that let the AI traverse knowledge the way humans do. For a deep dive on the memory architecture, read our persistent memory guide.

The critical design decision: embed continuously, not on-demand. We run an embedding cron job every five minutes that picks up new entries, generates vectors, and indexes them. When the AI searches memory, the vectors are already there. Search stays fast because the expensive work happened in the background.

Layer 4: Orchestration

The orchestration layer routes tasks to the right model, manages tool execution, and maintains conversation context. In a cloud-first world, this is your API gateway. In a local-first world, it is a Python script on your machine.

The Model Context Protocol (MCP) is the standard for connecting AI models to tools and data sources. MCP servers expose capabilities — file access, database queries, API calls, system commands — and AI models consume them through a unified protocol. Build an MCP server once, and any MCP-compatible client can use it.

For agent orchestration, the pattern is a loop: the model reasons about the task, selects a tool, executes it, observes the result, and repeats until done. We cover this in detail in our agentic loops explainer and our Agent SDK tutorial.

Hybrid Architecture: Local Default, Cloud When Needed

Pure local-first is a philosophy, not a prison. Some tasks genuinely need frontier models. Writing a 3,000-word technical blog post. Reviewing a complex pull request. Analyzing a legal document. These tasks benefit from Claude's million-token context window and deep reasoning capabilities.

The key is intentional routing. Every query hits local first. Only when local cannot handle it — context too long, reasoning too complex, quality threshold not met — does the system escalate to cloud.

def route_query(query: str, context_tokens: int) -> str:
    # Fast structured tasks: Apple Foundation Models
    if is_structured_task(query):
        return "apple_fm"
    # Short context, standard reasoning: local Ollama
    if context_tokens < 8000:
        return "ollama_local"
    # Long context or complex reasoning: cloud
    return "claude_api"

In practice, we find that 80-90% of queries resolve locally. Embeddings are always local. Memory operations are always local. Simple generation, classification, and summarization are always local. The cloud handles the remaining 10-20% — the hard problems where a larger model genuinely produces better results.

This is not about ideology. It is about economics. Local inference costs electricity. Cloud inference costs tokens. When you run thousands of queries per day for embeddings, search, and background processing, the economics of local-first become overwhelming. Our embedding pipeline processes entries every five minutes. At cloud API pricing, that would cost hundreds of dollars per month. Locally, it costs the electricity to run an M3 — which we are running anyway.

Security and Backup

Local-first means you own your data. That also means you own your backup strategy.

SQLite databases are single files. Back them up with a file copy — but respect the write-ahead log (WAL). A naive file copy during a write can produce a corrupt backup. Use SQLite's backup API or checkpoint the WAL first:

import sqlite3
import shutil

def safe_backup(source: str, dest: str):
    db = sqlite3.connect(source)
    db.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    db.close()
    shutil.copy2(source, dest)

We run daily WAL-safe backups to a local directory with 14-day retention, plus encrypted offsite backups to iCloud Drive using AES-256 encryption. The encryption key lives in the macOS Keychain — not in a config file, not in an environment variable, not in a git repository.

For secrets management, use your operating system's keychain. macOS Keychain, Linux Secret Service, Windows Credential Manager. These are purpose-built, hardware-backed credential stores. There is no reason to store API keys in .env files when every major OS provides encrypted credential storage.

Performance Benchmarks

Real numbers from our production system running on an M3 Max with 64GB unified memory:

Operation	Time	Notes
Embedding (mxbai-embed-large)	37ms	Single query, 1024 dimensions
Vector search (sqlite-vec)	2-5ms	Top-5 from 7,700+ vectors
FTS5 text search	0.3ms	BM25 ranking
Hybrid search (vec + FTS5)	45ms	Combined semantic + keyword
Hybrid search + reranking	~2.8s	LLM reranker for high-stakes queries
Apple FM generation	1.1s	Simple text response
Apple FM structured output	3.8s	@Generable quiz question
Ollama qwen3:14b generation	~6-10s	Paragraph-length response
Full brain boot (all context)	<1s	Load session state + keys

These numbers are fast enough for interactive use. The hybrid search path — the one that powers our AI assistant's memory — returns results in 45 milliseconds. That is imperceptible. Your AI remembers everything, instantly, without touching the network.

Getting Started: Minimum Viable Local AI

You do not need our full stack to start. Here is the minimum viable local-first AI system:

Install Ollama. One command: curl -fsSL https://ollama.com/install.sh | sh
Pull two models. One for chat (ollama pull qwen3:4b), one for embeddings (ollama pull mxbai-embed-large).
Create a SQLite database with sqlite-vec. pip install sqlite-vec. Create one table for text, one virtual table for vectors.
Write an embed-and-store script. Takes text, generates an embedding, stores both in the database. Run it on your notes, your code comments, your meeting transcripts — whatever you want your AI to remember.
Write a search-and-generate script. Takes a query, finds the most relevant stored entries by vector similarity, passes them as context to the chat model. This is RAG — running entirely on your machine.

That is a complete local-first AI system in five steps. No accounts. No API keys. No cloud services. Just your machine, your data, and models that run on your hardware.

From there, add layers as you need them: graph relationships for connected knowledge, FTS5 for keyword search, scheduled embedding for continuous indexing, MCP servers for tool integration. Each layer adds capability without adding external dependencies.

Common Mistakes

After building and running local-first AI systems for months, these are the mistakes we see most often:

Starting with the biggest model. A 70B model on 32GB RAM will swap to disk and crawl. Start with 4B or 7B. Upgrade only when quality is the bottleneck, not before.
Using a vector database instead of sqlite-vec. Unless you have millions of vectors and need distributed search, a vector database adds operational complexity with no benefit. SQLite handles tens of thousands of vectors at sub-5ms latency.
Embedding on-demand instead of in the background. If you embed at query time, your search has a 37ms tax on every query plus the embedding of the document itself. Embed continuously in the background so search stays fast.
Skipping the hybrid search path. Pure vector search misses exact keyword matches. Pure keyword search misses semantic similarity. Combine FTS5 and sqlite-vec for results that catch both.

The Future Is Local

The trajectory is clear. Models are getting smaller and faster. Hardware is getting more capable. Privacy regulations are getting stricter. The era of sending every AI query to a remote server is ending — not because the cloud is bad, but because the local alternative is now good enough for most tasks and strictly better for privacy, latency, and cost.

Build local-first. Use cloud when you need it. Own your data, your models, and your memory. That is not a philosophical stance — it is an engineering decision backed by the numbers.

For hands-on tutorials on the tools mentioned in this guide, explore our Academy — 53 courses covering AI architecture, MCP, RAG, persistent memory, and more. First three lessons of every course are free.

Need Help Building Your Local AI Stack?

We architect and build local-first AI systems — from SQLite memory to MCP integration to hybrid cloud routing. Talk to us.