Choosing the Right Model.
Size vs. quality vs. speed -- how to pick the right model for each task without wasting RAM or waiting forever.
After this lesson you'll know
- How model size (parameters) relates to quality and resource usage
- What quantization is and which level to choose
- Which models excel at specific tasks (coding, reasoning, writing, chat)
- How to benchmark and compare models on your own hardware
The Parameter-Quality Spectrum
Model parameters are the learned weights from training. More parameters generally means more capability -- but also more RAM, more disk space, and slower inference. The art of local AI is finding the smallest model that handles your task well.
Size tiers in practice:
- 1-3B parameters: Fast, lightweight. Good for simple classification, extraction, and basic Q&A. Think of these as smart autocomplete.
- 7-8B parameters: The local AI workhorse. Handles most writing, coding, and analysis tasks. Quality comparable to GPT-3.5 for straightforward work.
- 14-32B parameters: The sweet spot for serious local work. Strong reasoning, nuanced writing, complex code generation. This is where local starts competing with cloud.
- 70B+ parameters: Near-frontier quality. Requires significant hardware (64GB+ RAM or high-end GPU). Worth it for work that demands deep reasoning or long-context analysis.
Understanding Quantization
Full-precision models use 16 bits per parameter. Quantization compresses these to fewer bits, dramatically reducing size and RAM usage with minimal quality loss. This is what makes large models runnable on consumer hardware.
Quantization levels:
- Q8 (8-bit): Minimal quality loss. ~50% size reduction from full precision. Use when quality is paramount and you have the RAM.
- Q5: Barely perceptible quality loss. Good balance for most users.
- Q4 (4-bit): The default for most Ollama models. ~75% size reduction. Slight quality degradation but excellent for daily use. This is what you should start with.
- Q3 and below: Noticeable quality degradation. Only use when you absolutely must fit a larger model into limited RAM.
In Ollama, quantization is usually indicated by tags. For example, llama3.1:8b typically uses Q4, while llama3.1:8b-q8_0 uses Q8. Check with ollama show modelname to see the specific quantization.
This lesson is for Pro members
Unlock all 518+ lessons across 52 courses with Academy Pro.
Already a member? Sign in to access your lessons.