Choosing the Right Model.

Size vs. quality vs. speed -- how to pick the right model for each task without wasting RAM or waiting forever.

After this lesson you'll know

How model size (parameters) relates to quality and resource usage
What quantization is and which level to choose
Which models excel at specific tasks (coding, reasoning, writing, chat)
How to benchmark and compare models on your own hardware

The Parameter-Quality Spectrum

Model parameters are the learned weights from training. More parameters generally means more capability -- but also more RAM, more disk space, and slower inference. The art of local AI is finding the smallest model that handles your task well.

Size tiers in practice:

1-3B parameters: Fast, lightweight. Good for simple classification, extraction, and basic Q&A. Think of these as smart autocomplete.
7-8B parameters: The local AI workhorse. Handles most writing, coding, and analysis tasks. Quality comparable to GPT-3.5 for straightforward work.
14-32B parameters: The sweet spot for serious local work. Strong reasoning, nuanced writing, complex code generation. This is where local starts competing with cloud.
70B+ parameters: Near-frontier quality. Requires significant hardware (64GB+ RAM or high-end GPU). Worth it for work that demands deep reasoning or long-context analysis.

Rule of thumb: You need roughly 1GB of RAM per 1B parameters for a Q4 quantized model. An 8B model needs ~5GB, a 32B model needs ~20GB, a 70B model needs ~40GB. Always leave headroom for your OS and other applications.

Understanding Quantization

Full-precision models use 16 bits per parameter. Quantization compresses these to fewer bits, dramatically reducing size and RAM usage with minimal quality loss. This is what makes large models runnable on consumer hardware.

Quantization levels:

Q8 (8-bit): Minimal quality loss. ~50% size reduction from full precision. Use when quality is paramount and you have the RAM.
Q5: Barely perceptible quality loss. Good balance for most users.
Q4 (4-bit): The default for most Ollama models. ~75% size reduction. Slight quality degradation but excellent for daily use. This is what you should start with.
Q3 and below: Noticeable quality degradation. Only use when you absolutely must fit a larger model into limited RAM.

In Ollama, quantization is usually indicated by tags. For example, llama3.1:8b typically uses Q4, while llama3.1:8b-q8_0 uses Q8. Check with ollama show modelname to see the specific quantization.

The counterintuitive truth: A well-quantized larger model often outperforms a smaller model at full precision. A 32B model at Q4 typically beats a 14B model at Q8. When choosing between a bigger model with more compression or a smaller model with less compression, go bigger.

🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Go Pro — $49/mo ← Back to course

Already a member? Sign in to access your lessons.