Agent skill
unsloth-vision
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/unsloth-vision
SKILL.md
Overview
Unsloth-vision provides optimized support for fine-tuning multimodal models like Llama 3.2 Vision and Qwen2.5 VL. It allows granular control over which layers (vision, language, or both) are updated and includes specialized data collators to handle image padding.
When to Use
- When adapting models for specialized visual tasks like medical imaging or OCR-to-code.
- When fine-tuning Llama 3.2 Vision models on consumer hardware.
- When needing to train specifically on assistant responses in a multimodal context.
Decision Tree
- Do you need to update visual feature extraction?
- Yes: Set
finetune_vision_layers = Trueinget_peft_model.
- Yes: Set
- Are your images varying in size?
- Yes: Standardize to 300-1000px and use
UnslothVisionDataCollator.
- Yes: Standardize to 300-1000px and use
- Should the model ignore system/user prompts during loss calculation?
- Yes: Use
train_on_responses_only = Truein the collator.
- Yes: Use
Workflows
- Vision Model Setup: Load models via
FastVisionModel.from_pretrainedand enable PEFT targetingall-linearmodules. - Multimodal Dataset Preparation: Format data as 'user'/'assistant' conversations with
{'type': 'image'}content and standardized dimensions (300-1000px). - Training for Response Accuracy: Initialize
UnslothVisionDataCollatorwithtrain_on_responses_only = Trueand specified chat template headers.
Non-Obvious Insights
- Unsloth allows selective fine-tuning of just the vision layers, just the language layers, or specific components like attention/MLP layers.
- The
UnslothVisionDataCollatorautomatically masks out padding vision tokens, which is essential for stabilizing loss during training. - Standardizing image resolution to the 300-1000px range is the optimal balance between detail preservation and VRAM efficiency.
Evidence
- "To finetune vision models, we now allow you to select which parts of the mode to finetune... You can select to only finetune the vision layers, or the language layers..." Source
- "It is best to ensure your dataset has images of all the same size/dimensions. Use dimensions of 300-1000px..." Source
Scripts
scripts/unsloth-vision_tool.py: Loading and configuring FastVisionModel.scripts/unsloth-vision_tool.js: Dataset formatter for multimodal conversations.
Dependencies
unslothpillow(for image processing)torch
References
- references/README.md
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?