Local AI: Ollama & Open Models

What you'll learn

Setting up Ollama: installation, model pulling, and first inference
Model selection: which models for which tasks
Quantization: trading precision for speed and memory savings
Performance tuning: getting the most from your hardware

Foundation

What Is Ollama?

01ConceptUnderstand the core idea

→

02ApplySee it in practice

→

03BuildUse it in your projects

Master what is ollama? step by step.

Ollama is a tool that runs large language models on your local machine. Think of it as Docker for AI models -- it downloads, manages, and serves models through a simple API that any application can call.

You install it once, pull the models you want, and start making AI requests -- all without an internet connection after the initial download. The models run entirely on your CPU or GPU, your data never leaves your machine, and there are no per-request costs.

Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool built for OpenAI or Claude can be pointed at Ollama with minimal code changes. Your existing integrations just work.

Implementation

Getting Started

Setup takes less than 5 minutes:

# Install Ollama (macOS)
brew install ollama

# Or download from ollama.com for any platform
# Linux: curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama server
ollama serve

# Pull your first model (in a new terminal)
ollama pull qwen2.5:7b          # 4.4 GB, excellent for general tasks
ollama pull deepseek-coder-v2   # Specialized for code
ollama pull llama3.1:8b         # Meta's versatile model

# Test it immediately
ollama run qwen2.5:7b "Explain what sovereignty means for AI in 3 sentences."

That is it. You now have a local AI that responds to natural language, writes code, summarizes documents, and answers questions -- all without an internet connection or API key.

Strategy

Model Selection Guide

Not all models are created equal. Here is how to choose the right model for each task:

General assistant (Qwen 2.5 7B): The best all-around small model. Excellent at following instructions, summarizing, drafting, and Q&A. Runs well on 8GB RAM. Your default model for 80% of tasks.

Code generation (DeepSeek Coder V2): Specialized for writing and debugging code. Understands dozens of programming languages. Better at code than general models twice its size. Use this for development tasks.

Complex reasoning (Llama 3.1 70B): Meta's largest open model. Approaches cloud model quality for analysis, planning, and nuanced writing. Requires 40GB+ RAM. Use when you need frontier-quality reasoning without the cloud.

Embeddings (nomic-embed-text): Converts text into vector embeddings for search and retrieval. Fast, small, and purpose-built. Essential for building your local RAG pipeline.

# Model sizes and RAM requirements
# Model           Size    RAM     Best For
# qwen2.5:3b      2 GB    4 GB    Quick tasks, low-end hardware
# qwen2.5:7b      4 GB    8 GB    General assistant (recommended start)
# llama3.1:8b     5 GB    8 GB    Versatile, strong reasoning
# deepseek-coder  4 GB    8 GB    Code generation
# qwen2.5:14b     9 GB   16 GB    Better quality, more RAM
# llama3.1:70b   40 GB   48 GB    Near-frontier quality
# nomic-embed     274 MB   1 GB    Embeddings only

Concept

Quantization: The Quality-Speed Tradeoff

AI models are stored as numbers (weights). Full-precision weights use 16 bits per number. Quantization reduces this to 8 bits, 4 bits, or even 2 bits -- making the model smaller, faster, and able to run on less RAM.

Think of it like image compression. A full-quality photo is 10MB. A compressed version is 2MB. You lose some detail, but for most purposes it looks the same. Quantization works the same way for AI models.

Q4_K_M (4-bit, recommended): The sweet spot. Models are roughly 4x smaller than full precision. Quality loss is minimal for most tasks. This is what Ollama uses by default.

Q8_0 (8-bit): Higher quality, but models are 2x larger. Use when you have enough RAM and quality matters (long-form writing, complex analysis).

Q2_K (2-bit): Maximum compression. Models are tiny but quality degrades noticeably. Use only when RAM is severely constrained and you need something running.

Advanced

Performance Tuning

Getting the best performance from local models means understanding your hardware and configuring Ollama to match:

GPU acceleration. If you have an Apple Silicon Mac (M1/M2/M3/M4), Ollama automatically uses the GPU. This is 5-10x faster than CPU-only. On Linux/Windows, NVIDIA GPUs with CUDA support provide similar acceleration. Check with ollama ps to verify GPU is active.

Context window. By default, Ollama uses a 2048-token context window. For longer documents, increase it: ollama run qwen2.5:7b --num-ctx 8192. Larger context uses more RAM but lets the model process longer inputs.

Concurrent requests. Ollama handles one request at a time by default. For multiple simultaneous users or agents, set OLLAMA_NUM_PARALLEL=4 to allow parallel processing. Each parallel request uses additional RAM.

Keep alive. After a request, Ollama keeps the model in memory for 5 minutes by default. Adjust with OLLAMA_KEEP_ALIVE=30m to keep it loaded longer (faster subsequent requests) or 0 to unload immediately (save RAM).

Integration

Calling Ollama from Code

Ollama exposes an API on localhost:11434. You can call it from any language:

// JavaScript/Node.js -- direct API call
const response = await fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({
    model: 'qwen2.5:7b',
    prompt: 'Summarize the benefits of local AI in 3 bullet points.',
    stream: false          // Set true for streaming responses
  })
});
const data = await response.json();
console.log(data.response);

// Python -- using the ollama library
import ollama
response = ollama.generate(
  model='qwen2.5:7b',
  prompt='Summarize the benefits of local AI in 3 bullet points.'
)
print(response['response'])

# Shell -- simple curl
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Summarize the benefits of local AI.",
  "stream": false
}'

Anti-Patterns

Local AI Mistakes

Running models too large for your RAM. A 70B model on a 16GB machine will thrash to swap, taking minutes per response. Check RAM requirements before pulling a model. Use ollama ps to see memory usage.

Expecting cloud quality from small models. A 7B parameter model is not Claude or GPT-4. It is excellent for routine tasks but struggles with complex multi-step reasoning. Use local for routine, cloud for complex. That is the hybrid strategy.

Not monitoring resource usage. Local models consume significant CPU, GPU, and RAM. Running a model while doing other intensive work (video editing, compiling) can freeze your machine. Monitor with Activity Monitor or htop.

Try It Yourself

Set up your local AI stack:

1. Install Ollama (brew install ollama or ollama.com)
2. Pull qwen2.5:7b (your general assistant)
3. Run a test: ollama run qwen2.5:7b "What can you help me with?"
4. Pull nomic-embed-text (for embeddings -- needed later)
5. Test the API: curl localhost:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"Hello","stream":false}'
6. Benchmark: time how long a 200-word response takes
   - Under 5 seconds on GPU = excellent
   - Under 15 seconds on CPU = acceptable

You now have a zero-cost AI running on your own hardware.

Review

Key concepts.

What Is Ollama?

A tool that runs LLMs on your local machine. Downloads, manages, and serves models through a REST API on localhost:11434. No internet needed after initial download. No per-request costs.

Model Selection Strategy

Qwen 2.5 7B for general tasks (8GB RAM). DeepSeek Coder for code. Llama 3.1 70B for complex reasoning (48GB RAM). Nomic-embed-text for embeddings. Match model to task and hardware.

Quantization

Compressing model weights from 16-bit to 4-bit (or less). Q4_K_M is the sweet spot -- 4x smaller with minimal quality loss. Q8_0 for higher quality. Q2_K for extreme compression.

GPU Acceleration

Apple Silicon Macs automatically use GPU (5-10x faster). NVIDIA GPUs with CUDA on Linux/Windows. Check with ollama ps to verify GPU is active.

The Hybrid Strategy

Run routine tasks locally (free). Use cloud APIs for complex reasoning (paid). 80/20 split saves 90%+ on AI costs while maintaining access to frontier capabilities.

RAM Rule

Never run a model that exceeds your available RAM. A 70B model needs 48GB+. A 7B model needs 8GB+. Running too-large models causes swap thrashing and minutes-per-response performance.

Check Your Understanding

Local AI quiz.