Agent skill
unsloth-datasets
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/unsloth-datasets
SKILL.md
Overview
Unsloth-datasets provides tools to prepare and optimize data for fine-tuning. Key features include standardizing external datasets (like ShareGPT/Alpaca) into a unified format, synthetically extending single-turn data into multi-turn conversations, and handling custom special tokens.
When to Use
- When converting raw datasets from diverse sources (Hugging Face, ShareGPT) into Unsloth-compatible formats.
- When you only have single-turn data but want the model to learn multi-turn conversation logic.
- When adding new domain-specific tokens (e.g.,
<THINKING>) to a model.
Decision Tree
- Is your dataset in ShareGPT format?
- Yes: Use
standardize_sharegpt().
- Yes: Use
- Do you have only single-turn data but want multi-turn performance?
- Yes: Use the
conversation_extensionparameter.
- Yes: Use the
- Are you adding new tokens?
- Yes: Call
add_new_tokens()BEFORE callingget_peft_model().
- Yes: Call
Workflows
- Standardizing External Datasets: Import
standardize_sharegpt, apply it to the dataset to map roles (e.g., 'human/gpt'), and apply the chat template usingformatting_prompts_func. - Adding Custom Domain Tokens: Load model/tokenizer, use
add_new_tokensto update matrices, and THEN initialize PEFT adapters to ensure new weights are covered. - Custom Chat Template Design: Define a Jinja2 template and explicit EOS token, then pass them as a tuple to
get_chat_template.
Non-Obvious Insights
- Applying the wrong chat template (e.g., Llama-3 template on a Mistral model) is a leading cause of poor fine-tuning performance.
- The
conversation_extensiontool creates synthetic multi-turn interactions by randomly merging single-turn rows, improving the model's contextual memory. - The order of operations is critical: Adding tokens after initializing LoRA will result in new tokens not being trained by the adapters.
Evidence
- "We introduced the conversation_extension parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation!" Source
- "Users must call add_new_tokens BEFORE get_peft_model to properly resize embedding matrices and LoRA adapters." Source
Scripts
scripts/unsloth-datasets_tool.py: Python tool for standardization and token addition.scripts/unsloth-datasets_tool.js: JS template for ShareGPT data mapping.
Dependencies
unslothdatasetsjinja2
References
- references/README.md
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?