Agent skills
multi-dataset-unification-pipe...

Agent skill

multi-dataset-unification-pipeline

When the user requests to find, analyze, and merge multiple datasets from Hugging Face or similar repositories into a unified format. This skill handles the complete pipeline searching for datasets, examining their structure, parsing format specifications, converting diverse dataset formats (ToolACE, Glaive, XLAM, etc.) into a standardized schema, and merging them into a single output file. Key triggers include requests involving 'dataset unification', 'format conversion', 'merge datasets', 'Hugging Face datasets', 'tool-calling datasets', 'JSONL output', or when users provide a format specification document like 'unified_format.md'.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/multi-dataset-unification-pipeline

SKILL.md

Instructions

1. Understand the Request

Identify the target datasets (names, IDs, or search queries).
Locate and read any provided format specification (e.g., unified_format.md).
Clarify the scope: number of entries per dataset, output file name/location.

2. Search and Validate Datasets

Use huggingface-dataset_search to find each dataset.
Use huggingface-hub_repo_details to get metadata and confirm availability.
Note the dataset size and structure.

3. Analyze Dataset Structures

Load a sample from each dataset using datasets.load_dataset.
Examine the column names and a few sample entries.
Identify the native format (e.g., ToolACE uses system/conversations, Glaive uses conversations/tools, XLAM uses query/answers/tools).

4. Convert to Unified Format

Always refer to the provided unified_format.md (or equivalent) for the target schema.
Key Rules:
- conversation_id: Format as {source_short_name}_{index} (e.g., toolace_0).
- messages: List of messages with role (user/assistant/tool), content, and optionally tool_calls or tool_call_id.
- tool_calls: For assistant messages, include id (format tool_call_{n}), name, arguments.
- tools: List of normalized tool definitions.
- Remove system messages if they only contain tool instructions.
- If assistant only made tool calls, set content to null.
- Do not add tool return results or assistant replies not in the original data.
Use the bundled scripts/convert_and_merge.py for reliable, deterministic conversion of ToolACE, Glaive, and XLAM formats.
For new dataset formats, write a custom converter following the patterns in the script.

5. Merge and Output

Limit entries per dataset as requested (default: first 500).
Merge all converted entries into a single list.
Write to a JSONL file in the workspace (default: unified_tool_call.jsonl).
Verify the output: count entries, check file size, validate format against the specification.

6. Finalize

Provide a summary: datasets found, entries converted, output location.
Optionally, show a sample entry from each source for validation.
Confirm the task is complete.

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/multi-dataset-unification-pipeline
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Instructions

1. Understand the Request

2. Search and Validate Datasets

3. Analyze Dataset Structures

4. Convert to Unified Format

5. Merge and Output

6. Finalize

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state