Agent skill
lora
Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.
Install this agent skill to your Project
npx add-skill https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/lora
SKILL.md
Using LoRA for Fine-tuning
LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.
Table of Contents
- Core Concepts
- Basic Setup
- Configuration Parameters
- QLoRA (Quantized LoRA)
- Training Patterns
- Saving and Loading
- Merging Adapters
- Best Practices
Core Concepts
How LoRA Works
Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
Where:
Wis the frozen pretrained weight matrix (d × k)Bis a trainable matrix (d × r)Ais a trainable matrix (r × k)ris the rank, much smaller than d and k
The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.
Why Use LoRA
| Aspect | Full Fine-tuning | LoRA |
|---|---|---|
| Trainable params | 100% | ~0.1-1% |
| Memory usage | High | Low |
| Adapter size | Full model | ~3-100 MB |
| Training speed | Slower | Faster |
| Multiple tasks | Separate models | Swap adapters |
Basic Setup
Installation
pip install peft transformers accelerate
Minimal Example
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# Load base model
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%
Configuration Parameters
LoraConfig Options
from peft import LoraConfig, TaskType
config = LoraConfig(
# Core parameters
r=16, # Rank of update matrices
lora_alpha=32, # Scaling factor (alpha/r applied to updates)
target_modules=["q_proj", "v_proj"], # Layers to adapt
# Regularization
lora_dropout=0.05, # Dropout on LoRA layers
bias="none", # "none", "all", or "lora_only"
# Task configuration
task_type=TaskType.CAUSAL_LM, # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS
# Advanced
modules_to_save=None, # Additional modules to train (e.g., ["lm_head"])
layers_to_transform=None, # Specific layer indices to adapt
use_rslora=False, # Rank-stabilized LoRA scaling
use_dora=False, # Weight-Decomposed LoRA
)
Target Modules by Architecture
# Llama, Mistral, Qwen
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
# GPT-2, GPT-J
target_modules = ["c_attn", "c_proj", "c_fc"]
# BERT, RoBERTa
target_modules = ["query", "key", "value", "dense"]
# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
# Phi
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
Finding Target Modules
# Print all linear layer names
from peft.utils import get_peft_model_state_dict
def find_target_modules(model):
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
# Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
layer_name = name.split(".")[-1]
linear_modules.add(layer_name)
return list(linear_modules)
print(find_target_modules(model))
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
Memory Requirements
| Model Size | Full FT (16-bit) | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Training Patterns
With Hugging Face Trainer
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
# Prepare dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example):
if example["input"]:
text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding=False,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
# Training arguments (note higher learning rate)
training_args = TrainingArguments(
output_dir="./lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4, # Higher than full fine-tuning
bf16=True,
logging_steps=10,
save_steps=500,
warmup_ratio=0.03,
gradient_checkpointing=True,
optim="adamw_torch_fused",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
With SFTTrainer (TRL)
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="./sft-lora",
max_seq_length=1024,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=lora_config, # Pass config directly, SFTTrainer applies it
dataset_text_field="text",
)
trainer.train()
Classification Task
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
modules_to_save=["classifier"], # Train classification head fully
)
model = get_peft_model(model, lora_config)
Saving and Loading
Save Adapter
# Save only LoRA weights (small file)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
# Push to Hub
model.push_to_hub("username/my-lora-adapter")
Load Adapter
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
# For inference
model.eval()
Switch Between Adapters
# Load multiple adapters
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")
# Switch active adapter
model.set_adapter("task1")
output = model.generate(**inputs)
model.set_adapter("task2")
output = model.generate(**inputs)
# Disable adapter (use base model)
with model.disable_adapter():
output = model.generate(**inputs)
Merging Adapters
Merge LoRA weights into the base model for deployment without adapter overhead.
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="cpu", # Merge on CPU to avoid memory issues
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
# Merge and unload
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# Push merged model to Hub
merged_model.push_to_hub("username/my-merged-model")
Best Practices
-
Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
-
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is
alpha/r -
Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:
pythontarget_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] -
Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
-
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
pythonmodel.gradient_checkpointing_enable() -
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
-
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
-
Save checkpoints frequently: LoRA adapters are small, so save often
-
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
-
Consider modules_to_save for task heads: For classification, train the classifier fully:
pythonmodules_to_save=["classifier", "score"]
References
See reference/ for detailed documentation:
advanced-techniques.md- DoRA, rsLoRA, adapter composition, and debugging
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
prompt-engineering
Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.
qlora
Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.
agents
Patterns and architectures for building AI agents and workflows with LLMs. Use when designing systems that involve tool use, multi-step reasoning, autonomous decision-making, or orchestration of LLM-driven tasks.
rlhf
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
transformers
Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
pytorch
Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.
Didn't find tool you were looking for?