Agent skill
transformers
Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
Install this agent skill to your Project
npx add-skill https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/transformers
SKILL.md
Using Hugging Face Transformers
Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.
Table of Contents
- Core Concepts
- Pipeline API
- Model Loading
- Inference Patterns
- Fine-tuning with Trainer
- Working with Modalities
- Memory and Performance
- Best Practices
Core Concepts
The Three Core Classes
Every model in Transformers has three core components:
from transformers import AutoConfig, AutoModel, AutoTokenizer
# Configuration: hyperparameters and architecture settings
config = AutoConfig.from_pretrained("bert-base-uncased")
# Model: the neural network weights
model = AutoModel.from_pretrained("bert-base-uncased")
# Tokenizer/Processor: converts inputs to tensors
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
The from_pretrained Pattern
All loading uses from_pretrained() which handles downloading, caching, and device placement:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatic device placement
)
Auto Classes
Use task-specific Auto classes for the correct model head:
from transformers import (
AutoModelForCausalLM, # Text generation (GPT, Llama)
AutoModelForSeq2SeqLM, # Encoder-decoder (T5, BART)
AutoModelForSequenceClassification, # Classification
AutoModelForTokenClassification, # NER, POS tagging
AutoModelForQuestionAnswering, # Extractive QA
AutoModelForMaskedLM, # BERT-style masked LM
AutoModelForImageClassification, # Vision models
AutoModelForSpeechSeq2Seq, # Speech recognition
)
Pipeline API
The pipeline() function provides high-level inference with minimal code:
Text Tasks
from transformers import pipeline
# Text generation
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("The secret to success is", max_new_tokens=50)
# Text classification
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York City.")
# Question answering
qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.")
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)
# Translation
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
Chat/Conversational
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])
Vision Tasks
# Image classification
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")
# Object detection
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
objects = detector("path/to/image.jpg")
# Image segmentation
segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic")
masks = segmenter("path/to/image.jpg")
Audio Tasks
# Speech recognition
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")
# Audio classification
classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
result = classifier("path/to/audio.wav")
Multimodal Tasks
# Visual question answering
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
answer = vqa(image="image.jpg", question="What color is the car?")
# Image-to-text (captioning)
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")
# Document question answering
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
answer = doc_qa(image="document.png", question="What is the total?")
Model Loading
Device Placement
from transformers import AutoModelForCausalLM
import torch
# Automatic placement across available devices
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Specific device
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map="cuda:0",
)
# Custom device map for model parallelism
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 1,
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
Loading from Local Path
# Save model locally
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")
# Load from local path
model = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")
Trust Remote Code
Some models require executing custom code from the Hub:
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
trust_remote_code=True, # Required for custom architectures
)
Inference Patterns
Text Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Basic generation
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# With generation config
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
)
Chat Templates
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
# Apply chat template
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Getting Embeddings
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embeddings(texts: list[str]) -> torch.Tensor:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs["attention_mask"]
embeddings = outputs.last_hidden_state
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = (embeddings * mask_expanded).sum(1)
sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
return sum_embeddings / sum_mask
embeddings = get_embeddings(["Hello world", "How are you?"])
Classification
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("I love this movie!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
print(f"{labels[idx]}: {prob:.4f}")
Fine-tuning with Trainer
Basic Fine-tuning
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
# Load data and model
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
)
# Tokenize dataset
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
Pushing to Hub
# Login first: huggingface-cli login
# Push model and tokenizer
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")
# Or use trainer
trainer.push_to_hub()
See reference/fine-tuning.md for advanced patterns including LoRA, custom data collators, and evaluation metrics.
Working with Modalities
Vision Models
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])
Audio Models
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device_map="auto",
)
# Load audio (use librosa, soundfile, or datasets)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Vision-Language Models
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)
Memory and Performance
Quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
load_in_8bit=True,
device_map="auto",
)
Flash Attention
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # Requires flash-attn package
device_map="auto",
)
torch.compile
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")
Batched Inference
texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
See reference/advanced-inference.md for streaming, KV caching, and serving patterns.
Best Practices
- Use bfloat16 over float16: Better numerical stability on modern GPUs
- Set pad token for generation:
tokenizer.pad_token = tokenizer.eos_token - Use device_map="auto": Let Accelerate handle device placement
- Enable Flash Attention: Significant speedup for long sequences
- Batch when possible: Amortize fixed costs across multiple inputs
- Use pipeline for quick prototyping: Switch to manual control for production
- Cache models locally: Set
HF_HOMEenvironment variable for model cache location - Check model license: Verify usage rights before deployment
References
See reference/ for detailed documentation:
fine-tuning.md- Advanced fine-tuning patterns with LoRA, PEFT, and custom trainingadvanced-inference.md- Generation strategies, streaming, and serving
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
prompt-engineering
Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.
qlora
Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.
agents
Patterns and architectures for building AI agents and workflows with LLMs. Use when designing systems that involve tool use, multi-step reasoning, autonomous decision-making, or orchestration of LLM-driven tasks.
rlhf
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
pytorch
Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.
lora
Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.
Didn't find tool you were looking for?