Agent skill

transformers

Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.

View SKILL.md on GitHub Repository

Stars 17

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/transformers

SKILL.md

Using Hugging Face Transformers

Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.

Core Concepts
Pipeline API
Model Loading
Inference Patterns
Fine-tuning with Trainer
Working with Modalities
Memory and Performance
Best Practices

Core Concepts

The Three Core Classes

Every model in Transformers has three core components:

python

from transformers import AutoConfig, AutoModel, AutoTokenizer

# Configuration: hyperparameters and architecture settings
config = AutoConfig.from_pretrained("bert-base-uncased")

# Model: the neural network weights
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenizer/Processor: converts inputs to tensors
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The `from_pretrained` Pattern

All loading uses from_pretrained() which handles downloading, caching, and device placement:

python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # Automatic device placement
)

Auto Classes

Use task-specific Auto classes for the correct model head:

python

from transformers import (
    AutoModelForCausalLM,          # Text generation (GPT, Llama)
    AutoModelForSeq2SeqLM,         # Encoder-decoder (T5, BART)
    AutoModelForSequenceClassification,  # Classification
    AutoModelForTokenClassification,     # NER, POS tagging
    AutoModelForQuestionAnswering,       # Extractive QA
    AutoModelForMaskedLM,                # BERT-style masked LM
    AutoModelForImageClassification,     # Vision models
    AutoModelForSpeechSeq2Seq,           # Speech recognition
)

Pipeline API

The pipeline() function provides high-level inference with minimal code:

Text Tasks

python

from transformers import pipeline

# Text generation
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("The secret to success is", max_new_tokens=50)

# Text classification
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York City.")

# Question answering
qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.")

# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)

# Translation
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")

Chat/Conversational

python

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])

Vision Tasks

python

# Image classification
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")

# Object detection
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
objects = detector("path/to/image.jpg")

# Image segmentation
segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic")
masks = segmenter("path/to/image.jpg")

Audio Tasks

python

# Speech recognition
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")

# Audio classification
classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
result = classifier("path/to/audio.wav")

Multimodal Tasks

python

# Visual question answering
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
answer = vqa(image="image.jpg", question="What color is the car?")

# Image-to-text (captioning)
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")

# Document question answering
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
answer = doc_qa(image="document.png", question="What is the total?")

Model Loading

Device Placement

python

from transformers import AutoModelForCausalLM
import torch

# Automatic placement across available devices
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Specific device
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="cuda:0",
)

# Custom device map for model parallelism
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 1,
    "model.norm": 1,
    "lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)

Loading from Local Path

python

# Save model locally
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")

# Load from local path
model = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")

Trust Remote Code

Some models require executing custom code from the Hub:

python

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True,  # Required for custom architectures
)

Inference Patterns

Text Generation

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Basic generation
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# With generation config
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
)

Chat Templates

python

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

# Apply chat template
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Getting Embeddings

python

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embeddings(texts: list[str]) -> torch.Tensor:
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # Mean pooling
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
    mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    sum_embeddings = (embeddings * mask_expanded).sum(1)
    sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
    return sum_embeddings / sum_mask

embeddings = get_embeddings(["Hello world", "How are you?"])

Classification

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love this movie!", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
    print(f"{labels[idx]}: {prob:.4f}")

Fine-tuning with Trainer

Basic Fine-tuning

python

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset

# Load data and model
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized = dataset.map(tokenize, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

Pushing to Hub

python

# Login first: huggingface-cli login

# Push model and tokenizer
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")

# Or use trainer
trainer.push_to_hub()

See reference/fine-tuning.md for advanced patterns including LoRA, custom data collators, and evaluation metrics.

Working with Modalities

Vision Models

python

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax(-1).item()
    print(model.config.id2label[predicted_class])

Audio Models

python

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load audio (use librosa, soundfile, or datasets)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Vision-Language Models

python

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)

Memory and Performance

Quantization

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_config,
    device_map="auto",
)

# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    load_in_8bit=True,
    device_map="auto",
)

Flash Attention

python

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Requires flash-attn package
    device_map="auto",
)

torch.compile

python

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")

Batched Inference

python

texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

See reference/advanced-inference.md for streaming, KV caching, and serving patterns.

Best Practices

Use bfloat16 over float16: Better numerical stability on modern GPUs
Set pad token for generation: tokenizer.pad_token = tokenizer.eos_token
Use device_map="auto": Let Accelerate handle device placement
Enable Flash Attention: Significant speedup for long sequences
Batch when possible: Amortize fixed costs across multiple inputs
Use pipeline for quick prototyping: Switch to manual control for production
Cache models locally: Set HF_HOME environment variable for model cache location
Check model license: Verify usage rights before deployment

References

See reference/ for detailed documentation:

fine-tuning.md - Advanced fine-tuning patterns with LoRA, PEFT, and custom training
advanced-inference.md - Generation strategies, streaming, and serving

Maintainer

itsmostafa Core maintainer

Source details

Full Name: itsmostafa/llm-engineering-skills
Branch: main
Path in repo: skills/transformers
License: MIT License
Topics: claude-code ai-agents codex-cli skills llm claude-skills ai-engineering generative-ai llm-engineering

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

itsmostafa/llm-engineering-skills

prompt-engineering

Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.

17 0

Explore

itsmostafa/llm-engineering-skills

qlora

Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.

17 0

Explore

itsmostafa/llm-engineering-skills

agents

Patterns and architectures for building AI agents and workflows with LLMs. Use when designing systems that involve tool use, multi-step reasoning, autonomous decision-making, or orchestration of LLM-driven tasks.

17 0

Explore

itsmostafa/llm-engineering-skills

rlhf

Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.

17 0

Explore

itsmostafa/llm-engineering-skills

pytorch

Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.

17 0

Explore

itsmostafa/llm-engineering-skills

lora

Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.

17 0

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Using Hugging Face Transformers

Table of Contents

Core Concepts

The Three Core Classes

The from_pretrained Pattern

Auto Classes

Pipeline API

Text Tasks

Chat/Conversational

Vision Tasks

Audio Tasks

Multimodal Tasks

Model Loading

Device Placement

Loading from Local Path

Trust Remote Code

Inference Patterns

Text Generation

Chat Templates

Getting Embeddings

Classification

Fine-tuning with Trainer

Basic Fine-tuning

Pushing to Hub

Working with Modalities

Vision Models

Audio Models

Vision-Language Models

Memory and Performance

Quantization

Flash Attention

torch.compile

Batched Inference

Best Practices

References

Recommended Agent Skills

prompt-engineering

qlora

agents

rlhf

pytorch

lora

The `from_pretrained` Pattern