After this lesson you'll know
- How model size (parameters) relates to quality and resource usage
- What quantization is and which level to choose
- Which models excel at specific tasks (coding, reasoning, writing, chat)
- How to benchmark and compare models on your own hardware
The Parameter-Quality Spectrum
Model parameters are the learned weights from training. More parameters generally means more capability -- but also more RAM, more disk space, and slower inference. The art of local AI is finding the smallest model that handles your task well.
Size tiers in practice:
- 1-3B parameters: Fast, lightweight. Good for simple classification, extraction, and basic Q&A. Think of these as smart autocomplete.
- 7-8B parameters: The local AI workhorse. Handles most writing, coding, and analysis tasks. Quality comparable to GPT-3.5 for straightforward work.
- 14-32B parameters: The sweet spot for serious local work. Strong reasoning, nuanced writing, complex code generation. This is where local starts competing with cloud.
- 70B+ parameters: Near-frontier quality. Requires significant hardware (64GB+ RAM or high-end GPU). Worth it for work that demands deep reasoning or long-context analysis.
Understanding Quantization
Full-precision models use 16 bits per parameter. Quantization compresses these to fewer bits, dramatically reducing size and RAM usage with minimal quality loss. This is what makes large models runnable on consumer hardware.
Quantization levels:
- Q8 (8-bit): Minimal quality loss. ~50% size reduction from full precision. Use when quality is paramount and you have the RAM.
- Q5: Barely perceptible quality loss. Good balance for most users.
- Q4 (4-bit): The default for most Ollama models. ~75% size reduction. Slight quality degradation but excellent for daily use. This is what you should start with.
- Q3 and below: Noticeable quality degradation. Only use when you absolutely must fit a larger model into limited RAM.
In Ollama, quantization is usually indicated by tags. For example, llama3.1:8b typically uses Q4, while llama3.1:8b-q8_0 uses Q8. Check with ollama show modelname to see the specific quantization.
Model Recommendations by Task
Not all models are created equal. Each model family has strengths:
General writing and chat:
- Llama 3.1 (8B, 70B) -- Meta's flagship. Strong all-rounder. Excellent instruction following.
- Qwen 2.5 (7B, 14B, 32B, 72B) -- Alibaba's model. Exceptional multilingual support and reasoning.
Code generation:
- Qwen 2.5 Coder (7B, 14B, 32B) -- Purpose-built for coding. Excels at Python, JavaScript, TypeScript.
- DeepSeek Coder V2 (16B) -- Strong at complex code reasoning and debugging.
Reasoning and analysis:
- DeepSeek-R1 (8B, 32B, 70B) -- Chain-of-thought reasoning model. Shows its work. Excellent for math, logic, and complex analysis.
- Qwen-QwQ (32B) -- Reasoning-focused with strong analytical capabilities.
Embeddings (for RAG/search):
- nomic-embed-text -- 137M parameters, fast, high-quality embeddings for document search.
- mxbai-embed-large -- 335M parameters, more accurate for nuanced similarity tasks.
Multi-Model Setup Example
A practical local AI lab might run three models:
ollama pull gemma2:2b # Fast model for simple tasks
ollama pull qwen2.5:14b # Daily driver for writing/analysis
ollama pull qwen2.5-coder:14b # Coding specialist
ollama pull nomic-embed-text # Embeddings for document search
Total disk space: ~15GB. Switch between them based on the task at hand.
Benchmarking on Your Hardware
Published benchmarks don't tell you how a model performs on your specific machine. Run your own tests:
Speed test:
# Time a generation (check tokens/second in output)
ollama run llama3.1:8b "Write a 200-word essay about climate change."
Ollama shows tokens per second in the response. Aim for 10+ tokens/sec for comfortable interactive use. Below 5 tokens/sec feels sluggish.
Quality test: Run the same 5 prompts through different models and compare outputs. Use prompts that match your actual use case:
- A writing task (draft an email or report section)
- A reasoning task (analyze a problem with multiple variables)
- A coding task (write a function with specific requirements)
- A summarization task (condense a long document)
- An instruction-following task (follow a multi-step prompt precisely)
Rate each output 1-5. The model with the best average across your tasks at an acceptable speed is your daily driver.
Model Management
Models take disk space. Manage them actively:
# Check disk usage per model
ollama list
# Remove models you don't use
ollama rm model-name
# Keep your daily driver + one specialist + one embedding model
# Delete everything else until you need it
Models can always be re-downloaded. Don't hoard them. Keep your disk clean and pull what you need when you need it. A lean setup with 3-4 models is better than a cluttered one with 20 that you never touch.
Quiz
1What quantization level do most default Ollama models use?
2When choosing between a larger model with more quantization or a smaller model with less quantization, which typically performs better?