Choosing the Right Model

Lesson Content

After this lesson you'll know

  • How model size (parameters) relates to quality and resource usage
  • What quantization is and which level to choose
  • Which models excel at specific tasks (coding, reasoning, writing, chat)
  • How to benchmark and compare models on your own hardware

The Parameter-Quality Spectrum

The Parameter-Quality Spectrum
01ConceptUnderstand the core idea
02ApplySee it in practice
03BuildUse it in your projects
Master the parameter-quality spectrum step by step.

Model parameters are the learned weights from training. More parameters generally means more capability -- but also more RAM, more disk space, and slower inference. The art of local AI is finding the smallest model that handles your task well.

Size tiers in practice:

  • 1-3B parameters: Fast, lightweight. Good for simple classification, extraction, and basic Q&A. Think of these as smart autocomplete.
  • 7-8B parameters: The local AI workhorse. Handles most writing, coding, and analysis tasks. Quality comparable to GPT-3.5 for straightforward work.
  • 14-32B parameters: The sweet spot for serious local work. Strong reasoning, nuanced writing, complex code generation. This is where local starts competing with cloud.
  • 70B+ parameters: Near-frontier quality. Requires significant hardware (64GB+ RAM or high-end GPU). Worth it for work that demands deep reasoning or long-context analysis.
Rule of thumb: You need roughly 1GB of RAM per 1B parameters for a Q4 quantized model. An 8B model needs ~5GB, a 32B model needs ~20GB, a 70B model needs ~40GB. Always leave headroom for your OS and other applications.

Understanding Quantization

Full-precision models use 16 bits per parameter. Quantization compresses these to fewer bits, dramatically reducing size and RAM usage with minimal quality loss. This is what makes large models runnable on consumer hardware.

Quantization levels:

  • Q8 (8-bit): Minimal quality loss. ~50% size reduction from full precision. Use when quality is paramount and you have the RAM.
  • Q5: Barely perceptible quality loss. Good balance for most users.
  • Q4 (4-bit): The default for most Ollama models. ~75% size reduction. Slight quality degradation but excellent for daily use. This is what you should start with.
  • Q3 and below: Noticeable quality degradation. Only use when you absolutely must fit a larger model into limited RAM.

In Ollama, quantization is usually indicated by tags. For example, llama3.1:8b typically uses Q4, while llama3.1:8b-q8_0 uses Q8. Check with ollama show modelname to see the specific quantization.

The counterintuitive truth: A well-quantized larger model often outperforms a smaller model at full precision. A 32B model at Q4 typically beats a 14B model at Q8. When choosing between a bigger model with more compression or a smaller model with less compression, go bigger.

Model Recommendations by Task

Not all models are created equal. Each model family has strengths:

General writing and chat:

  • Llama 3.1 (8B, 70B) -- Meta's flagship. Strong all-rounder. Excellent instruction following.
  • Qwen 2.5 (7B, 14B, 32B, 72B) -- Alibaba's model. Exceptional multilingual support and reasoning.

Code generation:

  • Qwen 2.5 Coder (7B, 14B, 32B) -- Purpose-built for coding. Excels at Python, JavaScript, TypeScript.
  • DeepSeek Coder V2 (16B) -- Strong at complex code reasoning and debugging.

Reasoning and analysis:

  • DeepSeek-R1 (8B, 32B, 70B) -- Chain-of-thought reasoning model. Shows its work. Excellent for math, logic, and complex analysis.
  • Qwen-QwQ (32B) -- Reasoning-focused with strong analytical capabilities.

Embeddings (for RAG/search):

  • nomic-embed-text -- 137M parameters, fast, high-quality embeddings for document search.
  • mxbai-embed-large -- 335M parameters, more accurate for nuanced similarity tasks.

Multi-Model Setup Example

A practical local AI lab might run three models:

ollama pull gemma2:2b          # Fast model for simple tasks
ollama pull qwen2.5:14b        # Daily driver for writing/analysis
ollama pull qwen2.5-coder:14b  # Coding specialist
ollama pull nomic-embed-text   # Embeddings for document search

Total disk space: ~15GB. Switch between them based on the task at hand.

Benchmarking on Your Hardware

Published benchmarks don't tell you how a model performs on your specific machine. Run your own tests:

Speed test:

# Time a generation (check tokens/second in output)
ollama run llama3.1:8b "Write a 200-word essay about climate change."

Ollama shows tokens per second in the response. Aim for 10+ tokens/sec for comfortable interactive use. Below 5 tokens/sec feels sluggish.

Quality test: Run the same 5 prompts through different models and compare outputs. Use prompts that match your actual use case:

  1. A writing task (draft an email or report section)
  2. A reasoning task (analyze a problem with multiple variables)
  3. A coding task (write a function with specific requirements)
  4. A summarization task (condense a long document)
  5. An instruction-following task (follow a multi-step prompt precisely)

Rate each output 1-5. The model with the best average across your tasks at an acceptable speed is your daily driver.

Model Management

Models take disk space. Manage them actively:

# Check disk usage per model
ollama list

# Remove models you don't use
ollama rm model-name

# Keep your daily driver + one specialist + one embedding model
# Delete everything else until you need it

Models can always be re-downloaded. Don't hoard them. Keep your disk clean and pull what you need when you need it. A lean setup with 3-4 models is better than a cluttered one with 20 that you never touch.

Quiz

1What quantization level do most default Ollama models use?

2When choosing between a larger model with more quantization or a smaller model with less quantization, which typically performs better?

Vocabulary

What is the RAM rule of thumb for quantized models?
Roughly 1GB of RAM per 1B parameters at Q4 quantization. 8B model needs ~5GB, 32B needs ~20GB, 70B needs ~40GB.
What are the four model size tiers and their strengths?
1-3B (smart autocomplete), 7-8B (daily workhorse, GPT-3.5 level), 14-32B (serious work, competes with cloud), 70B+ (near-frontier reasoning)
What is the recommended coding model for local AI?
Qwen 2.5 Coder (available in 7B, 14B, 32B sizes) -- purpose-built for code generation, strong at Python/JS/TS
What tokens/second speed is needed for comfortable interactive use?
10+ tokens/second for comfortable use. Below 5 tokens/second feels sluggish.
What is quantization?
Compressing model weights from 16 bits to fewer bits (8, 5, 4, or 3). Dramatically reduces size and RAM usage with minimal quality loss. Q4 is the standard default.
What embedding models does Ollama support?
nomic-embed-text (137M params, fast) and mxbai-embed-large (335M params, more accurate) for document search and RAG