📚Academy
likeone
online

Local AI: Ollama & Open Models

Your own AI, running on your own hardware, with zero API costs.

Ollama turns your laptop into an AI server. Open-weight models like Qwen, DeepSeek, and Llama run locally with no internet connection, no API keys, and no per-token charges. This lesson gets you from installation to production-quality local AI.

What you'll learn

  • Setting up Ollama: installation, model pulling, and first inference
  • Model selection: which models for which tasks
  • Quantization: trading precision for speed and memory savings
  • Performance tuning: getting the most from your hardware

What Is Ollama?

Ollama is a tool that runs large language models on your local machine. Think of it as Docker for AI models -- it downloads, manages, and serves models through a simple API that any application can call.

You install it once, pull the models you want, and start making AI requests -- all without an internet connection after the initial download. The models run entirely on your CPU or GPU, your data never leaves your machine, and there are no per-request costs.

Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool built for OpenAI or Claude can be pointed at Ollama with minimal code changes. Your existing integrations just work.

Getting Started

Setup takes less than 5 minutes:

# Install Ollama (macOS) brew install ollama # Or download from ollama.com for any platform # Linux: curl -fsSL https://ollama.com/install.sh | sh # Start the Ollama server ollama serve # Pull your first model (in a new terminal) ollama pull qwen2.5:7b # 4.4 GB, excellent for general tasks ollama pull deepseek-coder-v2 # Specialized for code ollama pull llama3.1:8b # Meta's versatile model # Test it immediately ollama run qwen2.5:7b "Explain what sovereignty means for AI in 3 sentences."

That is it. You now have a local AI that responds to natural language, writes code, summarizes documents, and answers questions -- all without an internet connection or API key.

Model Selection Guide

Not all models are created equal. Here is how to choose the right model for each task:

General assistant (Qwen 2.5 7B): The best all-around small model. Excellent at following instructions, summarizing, drafting, and Q&A. Runs well on 8GB RAM. Your default model for 80% of tasks.

Code generation (DeepSeek Coder V2): Specialized for writing and debugging code. Understands dozens of programming languages. Better at code than general models twice its size. Use this for development tasks.

Complex reasoning (Llama 3.1 70B): Meta's largest open model. Approaches cloud model quality for analysis, planning, and nuanced writing. Requires 40GB+ RAM. Use when you need frontier-quality reasoning without the cloud.

Embeddings (nomic-embed-text): Converts text into vector embeddings for search and retrieval. Fast, small, and purpose-built. Essential for building your local RAG pipeline.

# Model sizes and RAM requirements # Model Size RAM Best For # qwen2.5:3b 2 GB 4 GB Quick tasks, low-end hardware # qwen2.5:7b 4 GB 8 GB General assistant (recommended start) # llama3.1:8b 5 GB 8 GB Versatile, strong reasoning # deepseek-coder 4 GB 8 GB Code generation # qwen2.5:14b 9 GB 16 GB Better quality, more RAM # llama3.1:70b 40 GB 48 GB Near-frontier quality # nomic-embed 274 MB 1 GB Embeddings only

Quantization: The Quality-Speed Tradeoff

AI models are stored as numbers (weights). Full-precision weights use 16 bits per number. Quantization reduces this to 8 bits, 4 bits, or even 2 bits -- making the model smaller, faster, and able to run on less RAM.

Think of it like image compression. A full-quality photo is 10MB. A compressed version is 2MB. You lose some detail, but for most purposes it looks the same. Quantization works the same way for AI models.

Q4_K_M (4-bit, recommended): The sweet spot. Models are roughly 4x smaller than full precision. Quality loss is minimal for most tasks. This is what Ollama uses by default.

Q8_0 (8-bit): Higher quality, but models are 2x larger. Use when you have enough RAM and quality matters (long-form writing, complex analysis).

Q2_K (2-bit): Maximum compression. Models are tiny but quality degrades noticeably. Use only when RAM is severely constrained and you need something running.

🔒

This lesson is for Pro members

Unlock all 355+ lessons across 36 courses with Academy Pro. Founding members get 90% off — forever.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai