Agent skill

unsloth-datasets

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/unsloth-datasets

SKILL.md

Overview

Unsloth-datasets provides tools to prepare and optimize data for fine-tuning. Key features include standardizing external datasets (like ShareGPT/Alpaca) into a unified format, synthetically extending single-turn data into multi-turn conversations, and handling custom special tokens.

When to Use

When converting raw datasets from diverse sources (Hugging Face, ShareGPT) into Unsloth-compatible formats.
When you only have single-turn data but want the model to learn multi-turn conversation logic.
When adding new domain-specific tokens (e.g., <THINKING>) to a model.

Decision Tree

Is your dataset in ShareGPT format?
- Yes: Use standardize_sharegpt().
Do you have only single-turn data but want multi-turn performance?
- Yes: Use the conversation_extension parameter.
Are you adding new tokens?
- Yes: Call add_new_tokens() BEFORE calling get_peft_model().

Workflows

Standardizing External Datasets: Import standardize_sharegpt, apply it to the dataset to map roles (e.g., 'human/gpt'), and apply the chat template using formatting_prompts_func.
Adding Custom Domain Tokens: Load model/tokenizer, use add_new_tokens to update matrices, and THEN initialize PEFT adapters to ensure new weights are covered.
Custom Chat Template Design: Define a Jinja2 template and explicit EOS token, then pass them as a tuple to get_chat_template.

Non-Obvious Insights

Applying the wrong chat template (e.g., Llama-3 template on a Mistral model) is a leading cause of poor fine-tuning performance.
The conversation_extension tool creates synthetic multi-turn interactions by randomly merging single-turn rows, improving the model's contextual memory.
The order of operations is critical: Adding tokens after initializing LoRA will result in new tokens not being trained by the adapters.

Evidence

"We introduced the conversation_extension parameter, which essentially selects some random rows in your single turn dataset, and merges them into 1 conversation!" Source
"Users must call add_new_tokens BEFORE get_peft_model to properly resize embedding matrices and LoRA adapters." Source

Scripts

scripts/unsloth-datasets_tool.py: Python tool for standardization and token addition.
scripts/unsloth-datasets_tool.js: JS template for ShareGPT data mapping.

Dependencies

unsloth
datasets
jinja2

References

references/README.md

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/unsloth-datasets
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Overview

When to Use

Decision Tree

Workflows

Non-Obvious Insights

Evidence

Scripts

Dependencies

References

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state