Agent skill
data-designer
Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/data-designer
SKILL.md
Data Designer
Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.
Workflow
- Clarify requirements - Ask about purpose, columns, size, format
- Create schema - Write
dataset_schema.jsondefining columns - Generate preview - Run
batch_generator.pyfor 3-5 rows - Iterate - Refine based on feedback
- Generate full dataset - Batch generate, then merge
- Deliver - Export to requested format
Column Types
Statistical Samplers (No LLM)
| Type | Description | Key Params |
|---|---|---|
category |
Weighted random choice | values, weights |
subcategory |
Hierarchical (parent-based) | mapping, category |
uniform |
Uniform distribution | low, high, dtype |
gaussian |
Normal distribution | mean, std, min_val, max_val |
bernoulli |
Binary probability | p, true_value, false_value |
poisson |
Poisson distribution | mean |
datetime |
Random dates | start, end, format |
person |
Synthetic personas | fields, age_range, locale |
uuid |
Unique IDs | prefix, format |
LLM Columns (Claude generates)
| Type | Description |
|---|---|
llm_text |
Free-form text |
llm_code |
Code with syntax validation |
llm_structured |
JSON matching schema |
llm_judge |
Quality scoring |
Schema Format
Create dataset_schema.json:
{
"name": "dataset_name",
"seed": 42,
"columns": [
{"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
{"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
],
"output": {"format": "csv", "filename": "output"}
}
For full schema reference: references/schema.md
Jinja2 Templating
Reference columns in prompts:
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
Supports: {{ var }}, {{ obj.field }}, {% if %}, filters
Scripts
Generate Data
# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview
# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
Merge & Export
python scripts/merger.py --input batches/ --output dataset.csv --flatten
Formats: csv, json, jsonl, parquet
Generation Strategy
- Sampler columns first - Python scripts, fast
- LLM columns in dependency order - Topological sort by
depends_on - Batch processing - Generate in batches of 20-50 for large datasets
For LLM columns, Claude generates directly:
- Render Jinja2 prompt with row data
- Generate content
- Validate if configured
- Retry on failure (max 3)
Examples
Simple:
"Generate 50 product reviews with ratings 1-5"
Complex:
"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"
Code:
"Generate 100 Python functions with description, code (validated), tests"
Tips
- Use
seedfor reproducibility - Preview first, then scale
- Keep LLM prompts specific
- Use
subcategoryfor correlated data
Attribution
Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?