Agent skill

data-designer

Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/data-designer

SKILL.md

Data Designer

Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.

Workflow

Clarify requirements - Ask about purpose, columns, size, format
Create schema - Write dataset_schema.json defining columns
Generate preview - Run batch_generator.py for 3-5 rows
Iterate - Refine based on feedback
Generate full dataset - Batch generate, then merge
Deliver - Export to requested format

Column Types

Statistical Samplers (No LLM)

Type	Description	Key Params
`category`	Weighted random choice	`values`, `weights`
`subcategory`	Hierarchical (parent-based)	`mapping`, `category`
`uniform`	Uniform distribution	`low`, `high`, `dtype`
`gaussian`	Normal distribution	`mean`, `std`, `min_val`, `max_val`
`bernoulli`	Binary probability	`p`, `true_value`, `false_value`
`poisson`	Poisson distribution	`mean`
`datetime`	Random dates	`start`, `end`, `format`
`person`	Synthetic personas	`fields`, `age_range`, `locale`
`uuid`	Unique IDs	`prefix`, `format`

LLM Columns (Claude generates)

Type	Description
`llm_text`	Free-form text
`llm_code`	Code with syntax validation
`llm_structured`	JSON matching schema
`llm_judge`	Quality scoring

Schema Format

Create dataset_schema.json:

json

{
  "name": "dataset_name",
  "seed": 42,
  "columns": [
    {"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
    {"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
  ],
  "output": {"format": "csv", "filename": "output"}
}

For full schema reference: references/schema.md

Jinja2 Templating

Reference columns in prompts:

Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.

Supports: {{ var }}, {{ obj.field }}, {% if %}, filters

Scripts

Generate Data

bash

# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview

# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/

Merge & Export

bash

python scripts/merger.py --input batches/ --output dataset.csv --flatten

Formats: csv, json, jsonl, parquet

Generation Strategy

Sampler columns first - Python scripts, fast
LLM columns in dependency order - Topological sort by depends_on
Batch processing - Generate in batches of 20-50 for large datasets

For LLM columns, Claude generates directly:

Render Jinja2 prompt with row data
Generate content
Validate if configured
Retry on failure (max 3)

Examples

Simple:

"Generate 50 product reviews with ratings 1-5"

Complex:

"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"

Code:

"Generate 100 Python functions with description, code (validated), tests"

Tips

Use seed for reproducibility
Preview first, then scale
Keep LLM prompts specific
Use subcategory for correlated data

Attribution

Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/data-designer
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Data Designer

Workflow

Column Types

Statistical Samplers (No LLM)

LLM Columns (Claude generates)

Schema Format

Jinja2 Templating

Scripts

Generate Data

Merge & Export

Generation Strategy

Examples

Tips

Attribution

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state