Hugging Face Transformers Skill

Quick Reference

Task	Approach	Key Class
Text Generation	`pipeline("text-generation")`	`AutoModelForCausalLM`
Classification	`pipeline("text-classification")`	`AutoModelForSequenceClassification`
Embeddings	`sentence-transformers`	`SentenceTransformer`
NER	`pipeline("ner")`	`AutoModelForTokenClassification`
QA	`pipeline("question-answering")`	`AutoModelForQuestionAnswering`
Fine-tuning	`Trainer` API	`TrainingArguments`

Installation

bash

# Core transformers
pip install transformers torch

# With all extras
pip install transformers[torch] accelerate

# For embeddings
pip install sentence-transformers

# For quantization
pip install bitsandbytes

# For PEFT/LoRA
pip install peft

# For datasets
pip install datasets

Pipeline API (Fastest Start)

Text Generation

python

from transformers import pipeline

# Simple generation
generator = pipeline("text-generation", model="microsoft/DialoGPT-medium")
result = generator("Hello, how are you?", max_length=50, num_return_sequences=1)
print(result[0]["generated_text"])

# With specific model for instruction following
generator = pipeline(
    "text-generation",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    device_map="auto",
    torch_dtype="auto"
)

messages = [{"role": "user", "content": "Explain transformers in 2 sentences"}]
response = generator(messages, max_new_tokens=100)

Text Classification

python

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Multi-label classification
classifier = pipeline(
    "text-classification",
    model="facebook/bart-large-mnli",
    top_k=None  # Return all labels with scores
)

# Zero-shot classification
classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a tutorial about machine learning",
    candidate_labels=["education", "politics", "business"]
)

Named Entity Recognition

python

from transformers import pipeline

ner = pipeline("ner", aggregation_strategy="simple")
text = "Apple CEO Tim Cook announced new products in Cupertino"
entities = ner(text)

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")
# Apple: ORG (0.99)
# Tim Cook: PER (0.99)
# Cupertino: LOC (0.98)

Question Answering

python

from transformers import pipeline

qa = pipeline("question-answering")
context = """
Hugging Face is a company that develops tools for building machine learning 
applications. It was founded in 2016 and is headquartered in New York City.
"""
question = "When was Hugging Face founded?"
result = qa(question=question, context=context)
# {'answer': '2016', 'score': 0.98, 'start': 89, 'end': 93}

Summarization

python

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
text = """Your long document text here..."""
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)

Model and Tokenizer Loading

Basic Loading

python

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

# Load tokenizer and model separately
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# For text generation models
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype="auto"
)

Loading with Options

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,      # Half precision
    device_map="auto",              # Automatic GPU placement
    low_cpu_mem_usage=True,         # Reduce RAM during loading
    trust_remote_code=True,         # For custom architectures
    attn_implementation="flash_attention_2"  # If available
)

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    padding_side="left",            # For batch generation
    use_fast=True                   # Use Rust tokenizer
)
tokenizer.pad_token = tokenizer.eos_token  # Set pad token

Offline Loading

python

# Download model for offline use
from transformers import AutoModel, AutoTokenizer

model_name = "bert-base-uncased"
save_path = "./models/bert-base"

# Download and save
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
tokenizer.save_pretrained(save_path)
model.save_pretrained(save_path)

# Load offline
tokenizer = AutoTokenizer.from_pretrained(save_path, local_files_only=True)
model = AutoModel.from_pretrained(save_path, local_files_only=True)

Embeddings with Sentence Transformers

Basic Embeddings

python

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Single text
embedding = model.encode("Hello, world!")
print(f"Dimension: {len(embedding)}")  # 384

# Batch encoding
sentences = ["First sentence", "Second sentence", "Third sentence"]
embeddings = model.encode(sentences, show_progress_bar=True)

Semantic Similarity

python

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-mpnet-base-v2")

query = "How to learn Python?"
documents = [
    "Python tutorial for beginners",
    "Advanced JavaScript patterns",
    "Machine learning with Python",
    "Cooking recipes for dinner"
]

# Encode
query_embedding = model.encode(query, convert_to_tensor=True)
doc_embeddings = model.encode(documents, convert_to_tensor=True)

# Calculate similarity
scores = util.cos_sim(query_embedding, doc_embeddings)[0]

# Rank results
ranked = sorted(zip(documents, scores.tolist()), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.3f}: {doc}")

Embedding Model Selection

Model	Dim	Use Case
`all-MiniLM-L6-v2`	384	Fast, general purpose
`all-mpnet-base-v2`	768	Higher quality, balanced
`bge-large-en-v1.5`	1024	State-of-the-art retrieval
`e5-large-v2`	1024	Multilingual support
`nomic-embed-text-v1`	768	Long context (8K tokens)

Optimized Embedding Generation

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")

# Batch processing with optimal settings
embeddings = model.encode(
    large_text_list,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True  # For cosine similarity
)

Text Generation (Advanced)

Chat-Style Generation

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a haiku about coding"}
]

# Apply chat template
input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

# Generate
outputs = model.generate(
    input_ids,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)

Streaming Generation

python

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

inputs = tokenizer("Explain quantum computing:", return_tensors="pt").to(model.device)

generation_kwargs = {
    **inputs,
    "streamer": streamer,
    "max_new_tokens": 200,
    "do_sample": True,
    "temperature": 0.7
}

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for text in streamer:
    print(text, end="", flush=True)

Generation Parameters

python

outputs = model.generate(
    input_ids,
    # Length control
    max_new_tokens=256,
    min_new_tokens=10,
    
    # Sampling strategy
    do_sample=True,
    temperature=0.8,        # Randomness (0.0 = greedy)
    top_k=50,               # Top-k sampling
    top_p=0.95,             # Nucleus sampling
    
    # Repetition control
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
    
    # Beam search (alternative to sampling)
    # num_beams=4,
    # early_stopping=True,
    
    # Other
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

Quantization

BitsAndBytes 4-bit Quantization

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # Normal float 4
    bnb_4bit_compute_dtype=torch.float16,    # Computation dtype
    bnb_4bit_use_double_quant=True           # Nested quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

8-bit Quantization

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

GPTQ Models (Pre-quantized)

python

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-quantized GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-v0.1-GPTQ",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-v0.1-GPTQ")

Memory Comparison

Model (7B)	Precision	VRAM	Quality
Full	FP32	~28GB	100%
Half	FP16	~14GB	~99%
8-bit	INT8	~7GB	~97%
4-bit	NF4	~4GB	~95%

Fine-Tuning with Trainer

Basic Fine-Tuning

python

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)

# Train
trainer.train()

Custom Metrics

python

import numpy as np
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted")
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

PEFT and LoRA

LoRA Fine-Tuning

python

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_4bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Alpha scaling
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

Save and Load LoRA Adapters

python

# Save adapter only (small file)
model.save_pretrained("./lora-adapter")

# Load adapter onto base model
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge adapter into base model (optional)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

QLoRA (Quantized LoRA)

python

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)

Model Selection from Hub

Finding Models

python

from huggingface_hub import HfApi, list_models

api = HfApi()

# Search for models
models = list_models(
    filter="text-generation",
    sort="downloads",
    direction=-1,
    limit=10
)

for model in models:
    print(f"{model.id}: {model.downloads:,} downloads")

# Get model info
model_info = api.model_info("meta-llama/Llama-2-7b-hf")
print(f"Tags: {model_info.tags}")
print(f"License: {model_info.cardData.get('license', 'Unknown')}")

Recommended Models by Task

Task	Small (<3B)	Medium (3-13B)	Large (>13B)
Chat	TinyLlama-1.1B	Phi-3-mini-4k	Llama-3-70B
Code	CodeGemma-2B	CodeLlama-7B	DeepSeek-Coder-33B
Embeddings	all-MiniLM-L6-v2	bge-base-en-v1.5	bge-large-en-v1.5
Classification	DistilBERT	RoBERTa-base	DeBERTa-v3-large

Local Inference Optimization

Batch Processing

python

from transformers import pipeline
import torch

# Optimal batching for embeddings
pipe = pipeline(
    "feature-extraction",
    model="sentence-transformers/all-MiniLM-L6-v2",
    device=0 if torch.cuda.is_available() else -1
)

texts = ["text1", "text2", "text3", ...]
batch_size = 32
results = []

for i in range(0, len(texts), batch_size):
    batch = texts[i:i + batch_size]
    embeddings = pipe(batch, truncation=True)
    results.extend(embeddings)

torch.compile for Speed

python

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Compile for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

KV-Cache and Attention Optimization

python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    device_map="auto",
    torch_dtype="auto",
    attn_implementation="flash_attention_2"  # Faster attention
)

# For multi-turn generation, use past_key_values
outputs = model.generate(
    input_ids,
    max_new_tokens=50,
    use_cache=True,  # Enable KV-cache
    return_dict_in_generate=True
)
past_key_values = outputs.past_key_values

# Continue generation with cached context
new_outputs = model.generate(
    new_input_ids,
    past_key_values=past_key_values,
    max_new_tokens=50
)

Memory-Efficient Inference

python

from transformers import AutoModelForCausalLM
import torch

# Gradient checkpointing for large models
model = AutoModelForCausalLM.from_pretrained(
    "large-model",
    device_map="auto",
    torch_dtype=torch.float16
)

# Enable for training (saves memory, slower)
model.gradient_checkpointing_enable()

# Inference with no_grad
with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=100)

Best Practices

1. Choose the Right Model Size

python

# Start small, scale up only if needed
MODELS_BY_USE_CASE = {
    "quick_prototype": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "production_chat": "microsoft/Phi-3-mini-4k-instruct",
    "code_generation": "codellama/CodeLlama-7b-hf",
    "embeddings": "sentence-transformers/all-MiniLM-L6-v2"
}

2. Always Set Device and Dtype

python

import torch
from transformers import AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map="auto" if device == "cuda" else None
)

3. Handle Tokenizer Edge Cases

python

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set pad token for batching
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# For left-padding in generation
tokenizer.padding_side = "left"

4. Error Handling for Generation

python

def safe_generate(model, tokenizer, prompt, **kwargs):
    try:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=kwargs.get("max_new_tokens", 256),
                pad_token_id=tokenizer.eos_token_id,
                **kwargs
            )
        
        response = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )
        return response.strip()
        
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        raise RuntimeError("GPU OOM - try smaller batch or quantized model")

5. Environment Configuration

python

import os

# Cache directory
os.environ["HF_HOME"] = "/path/to/cache"
os.environ["TRANSFORMERS_CACHE"] = "/path/to/models"

# Offline mode
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"

# Disable telemetry
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"

# Token for gated models
os.environ["HF_TOKEN"] = "your_token_here"

Common Patterns

Embedding Service

python

from sentence_transformers import SentenceTransformer
from functools import lru_cache
import numpy as np

class EmbeddingService:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
    
    def embed(self, texts: list[str], normalize: bool = True) -> np.ndarray:
        return self.model.encode(
            texts,
            normalize_embeddings=normalize,
            show_progress_bar=len(texts) > 100
        )
    
    @lru_cache(maxsize=10000)
    def embed_cached(self, text: str) -> tuple:
        return tuple(self.embed([text])[0])

LLM Wrapper

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class LocalLLM:
    def __init__(self, model_id: str, quantize: bool = True):
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        
        load_kwargs = {"device_map": "auto", "torch_dtype": torch.float16}
        if quantize:
            from transformers import BitsAndBytesConfig
            load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True)
        
        self.model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def generate(self, prompt: str, max_tokens: int = 256, **kwargs) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                pad_token_id=self.tokenizer.eos_token_id,
                **kwargs
            )
        
        return self.tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        )

Troubleshooting

Issue	Solution
CUDA OOM	Use 4-bit quantization or smaller model
Slow generation	Enable `use_cache=True`, use Flash Attention
Truncated output	Increase `max_new_tokens`
Repetitive text	Set `repetition_penalty=1.1`
Model not found	Check `HF_TOKEN` for gated models
Wrong device	Explicitly set `device_map="auto"`

Search AI Tools

Install this agent skill to your Project

SKILL.md

Hugging Face Transformers Skill

Quick Reference

Installation

Pipeline API (Fastest Start)

Text Generation

Text Classification

Named Entity Recognition

Question Answering

Summarization

Model and Tokenizer Loading

Basic Loading

Loading with Options

Offline Loading

Embeddings with Sentence Transformers

Basic Embeddings

Semantic Similarity

Embedding Model Selection

Optimized Embedding Generation

Text Generation (Advanced)

Chat-Style Generation

Streaming Generation

Generation Parameters

Quantization

BitsAndBytes 4-bit Quantization

8-bit Quantization

GPTQ Models (Pre-quantized)

Memory Comparison

Fine-Tuning with Trainer

Basic Fine-Tuning

Custom Metrics

PEFT and LoRA

LoRA Fine-Tuning

Save and Load LoRA Adapters

QLoRA (Quantized LoRA)

Model Selection from Hub

Finding Models

Recommended Models by Task

Local Inference Optimization

Batch Processing

torch.compile for Speed

KV-Cache and Attention Optimization

Memory-Efficient Inference

Best Practices

1. Choose the Right Model Size

2. Always Set Device and Dtype

3. Handle Tokenizer Edge Cases

4. Error Handling for Generation

5. Environment Configuration

Common Patterns

Embedding Service

LLM Wrapper

Troubleshooting