Local AI: Ollama & Open Models
Your own AI, running on your own hardware, with zero API costs.
Ollama turns your laptop into an AI server. Open-weight models like Qwen, DeepSeek, and Llama run locally with no internet connection, no API keys, and no per-token charges. This lesson gets you from installation to production-quality local AI.
What you'll learn
- Setting up Ollama: installation, model pulling, and first inference
- Model selection: which models for which tasks
- Quantization: trading precision for speed and memory savings
- Performance tuning: getting the most from your hardware
What Is Ollama?
Ollama is a tool that runs large language models on your local machine. Think of it as Docker for AI models -- it downloads, manages, and serves models through a simple API that any application can call.
You install it once, pull the models you want, and start making AI requests -- all without an internet connection after the initial download. The models run entirely on your CPU or GPU, your data never leaves your machine, and there are no per-request costs.
Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool built for OpenAI or Claude can be pointed at Ollama with minimal code changes. Your existing integrations just work.
Getting Started
Setup takes less than 5 minutes:
# Install Ollama (macOS)
brew install ollama
# Or download from ollama.com for any platform
# Linux: curl -fsSL https://ollama.com/install.sh | sh
# Start the Ollama server
ollama serve
# Pull your first model (in a new terminal)
ollama pull qwen2.5:7b # 4.4 GB, excellent for general tasks
ollama pull deepseek-coder-v2 # Specialized for code
ollama pull llama3.1:8b # Meta's versatile model
# Test it immediately
ollama run qwen2.5:7b "Explain what sovereignty means for AI in 3 sentences."That is it. You now have a local AI that responds to natural language, writes code, summarizes documents, and answers questions -- all without an internet connection or API key.
Model Selection Guide
Not all models are created equal. Here is how to choose the right model for each task:
General assistant (Qwen 2.5 7B): The best all-around small model. Excellent at following instructions, summarizing, drafting, and Q&A. Runs well on 8GB RAM. Your default model for 80% of tasks.
Code generation (DeepSeek Coder V2): Specialized for writing and debugging code. Understands dozens of programming languages. Better at code than general models twice its size. Use this for development tasks.
Complex reasoning (Llama 3.1 70B): Meta's largest open model. Approaches cloud model quality for analysis, planning, and nuanced writing. Requires 40GB+ RAM. Use when you need frontier-quality reasoning without the cloud.
Embeddings (nomic-embed-text): Converts text into vector embeddings for search and retrieval. Fast, small, and purpose-built. Essential for building your local RAG pipeline.
# Model sizes and RAM requirements
# Model Size RAM Best For
# qwen2.5:3b 2 GB 4 GB Quick tasks, low-end hardware
# qwen2.5:7b 4 GB 8 GB General assistant (recommended start)
# llama3.1:8b 5 GB 8 GB Versatile, strong reasoning
# deepseek-coder 4 GB 8 GB Code generation
# qwen2.5:14b 9 GB 16 GB Better quality, more RAM
# llama3.1:70b 40 GB 48 GB Near-frontier quality
# nomic-embed 274 MB 1 GB Embeddings onlyQuantization: The Quality-Speed Tradeoff
AI models are stored as numbers (weights). Full-precision weights use 16 bits per number. Quantization reduces this to 8 bits, 4 bits, or even 2 bits -- making the model smaller, faster, and able to run on less RAM.
Think of it like image compression. A full-quality photo is 10MB. A compressed version is 2MB. You lose some detail, but for most purposes it looks the same. Quantization works the same way for AI models.
Q4_K_M (4-bit, recommended): The sweet spot. Models are roughly 4x smaller than full precision. Quality loss is minimal for most tasks. This is what Ollama uses by default.
Q8_0 (8-bit): Higher quality, but models are 2x larger. Use when you have enough RAM and quality matters (long-form writing, complex analysis).
Q2_K (2-bit): Maximum compression. Models are tiny but quality degrades noticeably. Use only when RAM is severely constrained and you need something running.
This lesson is for Pro members
Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.
Already a member? Sign in to access your lessons.