What you'll learn
- Setting up Ollama: installation, model pulling, and first inference
- Model selection: which models for which tasks
- Quantization: trading precision for speed and memory savings
- Performance tuning: getting the most from your hardware
What Is Ollama?
Ollama is a tool that runs large language models on your local machine. Think of it as Docker for AI models -- it downloads, manages, and serves models through a simple API that any application can call.
You install it once, pull the models you want, and start making AI requests -- all without an internet connection after the initial download. The models run entirely on your CPU or GPU, your data never leaves your machine, and there are no per-request costs.
Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool built for OpenAI or Claude can be pointed at Ollama with minimal code changes. Your existing integrations just work.
Getting Started
Setup takes less than 5 minutes:
# Install Ollama (macOS)
brew install ollama
# Or download from ollama.com for any platform
# Linux: curl -fsSL https://ollama.com/install.sh | sh
# Start the Ollama server
ollama serve
# Pull your first model (in a new terminal)
ollama pull qwen2.5:7b # 4.4 GB, excellent for general tasks
ollama pull deepseek-coder-v2 # Specialized for code
ollama pull llama3.1:8b # Meta's versatile model
# Test it immediately
ollama run qwen2.5:7b "Explain what sovereignty means for AI in 3 sentences."That is it. You now have a local AI that responds to natural language, writes code, summarizes documents, and answers questions -- all without an internet connection or API key.
Model Selection Guide
Not all models are created equal. Here is how to choose the right model for each task:
General assistant (Qwen 2.5 7B): The best all-around small model. Excellent at following instructions, summarizing, drafting, and Q&A. Runs well on 8GB RAM. Your default model for 80% of tasks.
Code generation (DeepSeek Coder V2): Specialized for writing and debugging code. Understands dozens of programming languages. Better at code than general models twice its size. Use this for development tasks.
Complex reasoning (Llama 3.1 70B): Meta's largest open model. Approaches cloud model quality for analysis, planning, and nuanced writing. Requires 40GB+ RAM. Use when you need frontier-quality reasoning without the cloud.
Embeddings (nomic-embed-text): Converts text into vector embeddings for search and retrieval. Fast, small, and purpose-built. Essential for building your local RAG pipeline.
# Model sizes and RAM requirements
# Model Size RAM Best For
# qwen2.5:3b 2 GB 4 GB Quick tasks, low-end hardware
# qwen2.5:7b 4 GB 8 GB General assistant (recommended start)
# llama3.1:8b 5 GB 8 GB Versatile, strong reasoning
# deepseek-coder 4 GB 8 GB Code generation
# qwen2.5:14b 9 GB 16 GB Better quality, more RAM
# llama3.1:70b 40 GB 48 GB Near-frontier quality
# nomic-embed 274 MB 1 GB Embeddings onlyQuantization: The Quality-Speed Tradeoff
AI models are stored as numbers (weights). Full-precision weights use 16 bits per number. Quantization reduces this to 8 bits, 4 bits, or even 2 bits -- making the model smaller, faster, and able to run on less RAM.
Think of it like image compression. A full-quality photo is 10MB. A compressed version is 2MB. You lose some detail, but for most purposes it looks the same. Quantization works the same way for AI models.
Q4_K_M (4-bit, recommended): The sweet spot. Models are roughly 4x smaller than full precision. Quality loss is minimal for most tasks. This is what Ollama uses by default.
Q8_0 (8-bit): Higher quality, but models are 2x larger. Use when you have enough RAM and quality matters (long-form writing, complex analysis).
Q2_K (2-bit): Maximum compression. Models are tiny but quality degrades noticeably. Use only when RAM is severely constrained and you need something running.
Performance Tuning
Getting the best performance from local models means understanding your hardware and configuring Ollama to match:
GPU acceleration. If you have an Apple Silicon Mac (M1/M2/M3/M4), Ollama automatically uses the GPU. This is 5-10x faster than CPU-only. On Linux/Windows, NVIDIA GPUs with CUDA support provide similar acceleration. Check with ollama ps to verify GPU is active.
Context window. By default, Ollama uses a 2048-token context window. For longer documents, increase it: ollama run qwen2.5:7b --num-ctx 8192. Larger context uses more RAM but lets the model process longer inputs.
Concurrent requests. Ollama handles one request at a time by default. For multiple simultaneous users or agents, set OLLAMA_NUM_PARALLEL=4 to allow parallel processing. Each parallel request uses additional RAM.
Keep alive. After a request, Ollama keeps the model in memory for 5 minutes by default. Adjust with OLLAMA_KEEP_ALIVE=30m to keep it loaded longer (faster subsequent requests) or 0 to unload immediately (save RAM).
Calling Ollama from Code
Ollama exposes an API on localhost:11434. You can call it from any language:
// JavaScript/Node.js -- direct API call
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
body: JSON.stringify({
model: 'qwen2.5:7b',
prompt: 'Summarize the benefits of local AI in 3 bullet points.',
stream: false // Set true for streaming responses
})
});
const data = await response.json();
console.log(data.response);
// Python -- using the ollama library
import ollama
response = ollama.generate(
model='qwen2.5:7b',
prompt='Summarize the benefits of local AI in 3 bullet points.'
)
print(response['response'])
# Shell -- simple curl
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "Summarize the benefits of local AI.",
"stream": false
}'Local AI Mistakes
Running models too large for your RAM. A 70B model on a 16GB machine will thrash to swap, taking minutes per response. Check RAM requirements before pulling a model. Use ollama ps to see memory usage.
Expecting cloud quality from small models. A 7B parameter model is not Claude or GPT-4. It is excellent for routine tasks but struggles with complex multi-step reasoning. Use local for routine, cloud for complex. That is the hybrid strategy.
Not monitoring resource usage. Local models consume significant CPU, GPU, and RAM. Running a model while doing other intensive work (video editing, compiling) can freeze your machine. Monitor with Activity Monitor or htop.
Try It Yourself
Set up your local AI stack:
1. Install Ollama (brew install ollama or ollama.com)
2. Pull qwen2.5:7b (your general assistant)
3. Run a test: ollama run qwen2.5:7b "What can you help me with?"
4. Pull nomic-embed-text (for embeddings -- needed later)
5. Test the API: curl localhost:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"Hello","stream":false}'
6. Benchmark: time how long a 200-word response takes
- Under 5 seconds on GPU = excellent
- Under 15 seconds on CPU = acceptable
You now have a zero-cost AI running on your own hardware.Key concepts.
Local AI: Ollama & Open Models
What Is Ollama?
Model Selection Strategy
Quantization
GPU Acceleration
The Hybrid Strategy
RAM Rule
Local AI quiz.
Local AI: Ollama & Open Models
1What is quantization and why is it important for local AI?
2What is the recommended model selection strategy for a sovereign stack?
3Why should you check RAM before pulling a model?