Ollama: Your Local AI Lab

Lesson Content

After this lesson you'll know

  • How to install and configure Ollama on macOS, Linux, and Windows
  • Pulling, running, and managing models from the command line
  • Using the Ollama API for programmatic access
  • Essential Ollama commands every user should know

What Is Ollama?

What Is Ollama?
01ConceptUnderstand the core idea
02ApplySee it in practice
03BuildUse it in your projects
Master what is ollama? step by step.

Ollama is the Docker of local AI. It packages large language models into a simple command-line interface -- pull a model, run it, done. No Python environments, no dependency hell, no CUDA driver nightmares. It handles model downloading, quantization selection, memory management, and GPU acceleration automatically.

Ollama supports hundreds of open-source models: Llama 3.1, Mistral, Gemma 2, Qwen 2.5, DeepSeek, Phi-3, and more. It runs on macOS (Apple Silicon and Intel), Linux, and Windows. It exposes a local API on port 11434 that any application can connect to -- making it the foundation for everything we build in this course.

Installation

macOS:

curl -fsSL https://ollama.com/install.sh | sh

Or download the .dmg from ollama.com. Both methods install the CLI and the background service. Apple Silicon Macs get automatic GPU acceleration through Metal.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Supports Ubuntu 20.04+, Debian 11+, Fedora 36+, and most modern distributions. NVIDIA GPU acceleration requires CUDA drivers (installed separately).

Windows:

Download the installer from ollama.com. Requires Windows 10 or later. NVIDIA GPU support included. AMD GPU support is in preview.

Verify installation:

ollama --version

You should see the version number. If not, ensure the Ollama service is running.

First-time setup: Ollama starts a background service automatically. On macOS, you'll see an Ollama icon in your menu bar. On Linux, it runs as a systemd service. The service must be running before you can pull or run models.

Your First Model

Pull and run a model in two commands:

ollama pull llama3.1:8b
ollama run llama3.1:8b

The first command downloads the model (about 4.7GB for the 8B quantized version). The second launches an interactive chat session. Type your prompt, get a response, no API key required.

Recommended starter models by hardware:

  • 8GB RAM: llama3.1:8b or gemma2:2b (fast, lightweight)
  • 16GB RAM: qwen2.5:14b or mistral:7b (good balance)
  • 32GB RAM: qwen2.5:32b or deepseek-r1:32b (strong reasoning)
  • 64GB+ RAM: llama3.1:70b or qwen2.5:72b (near-frontier quality)

Quick Test Prompts

Once your model is running, try these to verify it works:

>>> Explain quantum computing in 3 sentences.
>>> Write a Python function that reverses a string.
>>> Summarize the key differences between TCP and UDP.

If you get coherent responses, your local AI lab is operational.

Essential Commands

These are the commands you'll use daily:

# List all downloaded models
ollama list

# Pull a specific model
ollama pull mistral:7b

# Run a model interactively
ollama run llama3.1:8b

# Run with a system prompt
ollama run llama3.1:8b "You are a helpful coding assistant."

# Show model details (size, parameters, license)
ollama show llama3.1:8b

# Remove a model to free disk space
ollama rm gemma2:2b

# List running models
ollama ps

# Copy a model (for creating custom variants)
ollama cp llama3.1:8b my-custom-model

Multiline input: In the interactive session, use triple quotes for long prompts:

>>> """
Analyze the following code for security vulnerabilities:
[paste code here]
"""

The Ollama API

Ollama exposes a REST API on localhost:11434 that lets any application use your local models. This is what makes Ollama a platform, not just a chat tool.

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "What is the capital of France?",
  "stream": false
}'

# Chat format (multi-turn conversation)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain Docker in one paragraph."}
  ],
  "stream": false
}'

# Generate embeddings (for search/RAG)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "This is a document to embed."
}'

The API is OpenAI-compatible, meaning tools built for OpenAI's API often work with Ollama by just changing the base URL to http://localhost:11434/v1.

Security note: By default, Ollama only listens on localhost. If you need network access (e.g., from another machine), set the environment variable OLLAMA_HOST=0.0.0.0. Only do this on trusted networks -- there's no authentication built in.

Quiz

1What port does the Ollama API run on by default?

2What is the recommended model for a machine with 16GB RAM?

Vocabulary

What command installs Ollama on macOS or Linux?
curl -fsSL https://ollama.com/install.sh | sh
How do you pull and run a model in Ollama?
ollama pull llama3.1:8b (downloads model) then ollama run llama3.1:8b (starts interactive chat)
How do you list all downloaded models?
ollama list
What makes Ollama's API compatible with existing tools?
It's OpenAI-compatible -- tools built for OpenAI's API work by changing the base URL to http://localhost:11434/v1
What is the security consideration when exposing Ollama to the network?
Ollama has no built-in authentication. Setting OLLAMA_HOST=0.0.0.0 exposes it to the network -- only do this on trusted networks.
How do you generate embeddings with Ollama?
Use the /api/embed endpoint with an embedding model like nomic-embed-text