After this lesson you'll know
- The mathematical intuition behind Low-Rank Adaptation (LoRA)
- How to configure LoRA rank, alpha, and target modules
- Hands-on LoRA fine-tuning with Hugging Face PEFT
- How to merge, swap, and stack multiple LoRA adapters
The Core Idea
The Core Idea
01ConceptUnderstand the core idea
→
02ApplySee it in practice
→
03BuildUse it in your projects
Master the core idea step by step.
Think of LoRA as adding a thin correction layer on top of the frozen model. The base model handles general intelligence. The LoRA adapter handles your specific task. Separate concerns, combined at inference.
Key Hyperparameters
Three parameters control LoRA behavior: **Rank (r):** The dimensionality of the low-rank matrices. Higher rank = more capacity = more VRAM. ``` r = 4: Minimal adaptation. Good for simple style changes. r = 8: Default. Handles most fine-tuning tasks well. r = 16: Strong adaptation. Good for complex domain shifts. r = 32: Near full fine-tuning quality. Higher VRAM cost. r = 64+: Rarely needed. Diminishing returns above 32. Rule of thumb: Start with r=8. Increase if validation loss plateaus. Decrease if you see overfitting. ``` **Alpha (lora_alpha):** A scaling factor applied to the LoRA update. The effective learning rate for LoRA weights is scaled by alpha/rank. ``` Common settings: alpha = rank (scaling factor = 1.0, neutral) alpha = 2 * rank (scaling factor = 2.0, stronger adaptation) alpha = 16 with rank = 8 (most common default pairing) The ratio alpha/rank matters more than the absolute values. Higher ratio = stronger LoRA influence on the output. ``` **Target modules:** Which layers in the model receive LoRA adapters. ```python # Common target module configurations: # Minimal (attention only) -- least VRAM, often sufficient target_modules = ["q_proj", "v_proj"] # Standard (all attention projections) target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"] # Comprehensive (attention + MLP) target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ] # Rule: Start with q_proj + v_proj. Add more if needed. # Each additional module increases VRAM by ~15-20%. ```
The rank-target tradeoff: Low rank on many modules often outperforms high rank on few modules. Try r=8 on all attention+MLP layers before trying r=32 on attention-only.
Going deeper on alpha and rank? Read our complete guide: LoRA Scaling Factor: Alpha, Rank & Tuning Guide — covers the alpha/rank ratio math, common misconfigurations, VRAM-aware tuning, and production patterns we use at Like One.
Hands-On: LoRA Fine-Tuning with PEFT
Here is a complete LoRA fine-tuning script using Hugging Face's PEFT library: ```python from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, ) from peft import LoraConfig, get_peft_model, TaskType from datasets import load_dataset # 1. Load base model and tokenizer model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", ) # 2. Configure LoRA lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], bias="none", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 13,631,488 || all: 8,043,631,616 # || trainable%: 0.1695 # 3. Load and format dataset dataset = load_dataset("json", data_files="training_data.jsonl") def format_example(example): messages = example["messages"] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False ) return tokenizer(text, truncation=True, max_length=2048) tokenized = dataset.map(format_example, remove_columns=["messages"]) # 4. Training arguments training_args = TrainingArguments( output_dir="./lora-output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, warmup_steps=100, logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch", fp16=True, report_to="none", ) # 5. Train trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"], ) trainer.train() # 6. Save adapter (NOT the full model) model.save_pretrained("./my-lora-adapter") # This saves only ~50MB instead of 16GB ``` VRAM requirements for this configuration: ``` Llama 3.1 8B with LoRA r=16 (4 attention modules): Model weights (fp16): ~16GB LoRA parameters: ~26MB Optimizer states: ~100MB Activations (batch=4): ~4GB Total: ~20GB VRAM Compare full fine-tuning: Model weights: ~16GB Gradients: ~16GB Optimizer states: ~32GB Activations: ~8GB Total: ~72GB VRAM ```Adapter Management
LoRA adapters are modular. You can merge, swap, and stack them: **Merging adapter into base model (for deployment):** ```python from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_name) model = PeftModel.from_pretrained(base_model, "./my-lora-adapter") # Merge LoRA weights into base model merged_model = model.merge_and_unload() # Save as a standard model (no PEFT dependency at inference) merged_model.save_pretrained("./merged-model") ``` **Swapping adapters at runtime:** ```python from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_name) model = PeftModel.from_pretrained(base_model, "./adapter-legal") # Swap to a different adapter without reloading base model model.load_adapter("./adapter-medical", adapter_name="medical") model.set_adapter("medical") # Swap back model.set_adapter("default") # back to legal adapter ``` **Stacking adapters (experimental):** Multiple LoRA adapters can be combined with weighted blending. Useful for multi-task models that need capabilities from different fine-tuning runs.Exercise: Fine-Tune with LoRA
Using the dataset from Lesson 2, fine-tune a small model (Llama 3.1 8B or Mistral 7B) with LoRA. Use r=8, alpha=16, target q_proj and v_proj. Train for 3 epochs. Compare the model's output before and after fine-tuning on 5 test examples. Note the adapter file size versus the full model size.Quiz
1What is the key mathematical insight behind LoRA?
2What does 'r=16 on all modules' vs 'r=32 on attention-only' typically produce?
Vocabulary
How much parameter reduction does LoRA achieve with rank 16 on a 4096x4096 weight matrix?
Full update: 16.7M parameters. LoRA r=16: 131K parameters. That is a 99.2% reduction in trainable parameters.
What are the three key LoRA hyperparameters?
1. Rank (r): dimensionality of low-rank matrices (typical: 8-32). 2. Alpha: scaling factor (common: 2x rank). 3. Target modules: which layers get adapters (start with q_proj + v_proj).
How much VRAM does LoRA fine-tuning of Llama 3.1 8B require vs full fine-tuning?
LoRA: ~20GB VRAM. Full fine-tuning: ~72GB VRAM. LoRA requires roughly 1/4 the memory.
What is the advantage of saving a LoRA adapter separately vs merging?
Separate adapters are tiny (10-100MB), can be swapped at runtime, and keep the base model untouched. Merged models are simpler to deploy but lose adapter modularity.