📚Academy
likeone
online

GPU Optimization & Performance Tuning.

Squeeze maximum speed from your hardware -- GPU offloading, memory management, and inference optimization for local AI.

After this lesson you'll know

  • How GPU acceleration works for AI inference (CUDA, Metal, ROCm)
  • Memory management strategies for running larger models
  • Ollama configuration options for performance tuning
  • Benchmarking and monitoring your local AI performance

How GPU Acceleration Works

AI model inference is fundamentally matrix multiplication -- thousands of parallel operations on arrays of numbers. CPUs handle these sequentially (a few operations at a time). GPUs handle them in parallel (thousands at once). This is why GPU-accelerated inference is 5-20x faster than CPU-only.

Acceleration frameworks by platform:

  • NVIDIA CUDA: The gold standard. Supported by every AI framework. Requires NVIDIA GPU + CUDA drivers. Works on Linux and Windows.
  • Apple Metal: Built into all Apple Silicon Macs (M1-M4). Ollama uses Metal automatically -- no configuration needed. Unified memory means GPU can access all system RAM.
  • AMD ROCm: Growing support on Linux. Some Ollama builds support ROCm for AMD GPUs. Less mature than CUDA but improving.

Ollama detects your GPU automatically and uses it. You can verify with ollama ps which shows how much of the model is loaded into GPU vs. CPU memory.

The VRAM bottleneck: On NVIDIA GPUs, the model must fit in VRAM (GPU memory). An RTX 4090 has 24GB VRAM -- enough for a 32B Q4 model. On Apple Silicon, there is no separate VRAM -- the unified memory pool is shared. A 32GB M3 MacBook can run models that would require a 24GB GPU on other platforms.

Memory Management Strategies

When a model is too large for your GPU, you have three options:

1. Quantize more aggressively. Drop from Q4 to Q3 or Q2. You lose some quality but gain significant memory savings. For a 70B model, Q3 vs Q4 can save 10GB+ of RAM.

2. Partial GPU offloading. Load some model layers on GPU and the rest on CPU. The GPU-loaded layers run fast, CPU layers run slower. Overall speed is a blend. In Ollama, this happens automatically when the model doesn't fully fit in GPU memory.

3. Use a smaller model. Often the best answer. A 14B model running entirely on GPU will outpace a 70B model split across GPU and CPU in tokens-per-second, even if the 70B model produces slightly better quality per token.

Ollama Memory Configuration

# Set maximum GPU layers (NVIDIA)
OLLAMA_NUM_GPU_LAYERS=35 ollama run llama3.1:70b

# Set number of threads for CPU inference
OLLAMA_NUM_THREADS=8 ollama run llama3.1:8b

# Keep model loaded in memory (prevents reload delay)
# Default: model stays loaded for 5 minutes after last use
OLLAMA_KEEP_ALIVE=30m ollama run qwen2.5:14b

# Set maximum memory Ollama can use
OLLAMA_MAX_LOADED_MODELS=2
🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai