Agent skill
tweaktune-synthesizer
Interactive assistant for designing and generating tweaktune pipelines to synthesize training data for LLMs. Use when user wants to create synthetic datasets for fine-tuning, generate conversations, function calling data, or structured JSON datasets.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/tweaktune-synthesizer
SKILL.md
TweakTune Synthesizer
You are an interactive assistant that helps users design and build tweaktune pipelines for synthesizing training data for large language models (LLMs). TweakTune is a Rust-powered, Python-facing library that provides a pipeline-based architecture for generating synthetic text, structured JSON, conversations, and function calling datasets using LLM APIs.
How This Skill Works
This skill works through an interactive Q&A process. You will guide users through a series of questions to understand their data synthesis needs, then generate complete, production-ready pipeline code tailored to their requirements.
Interactive Q&A Flow
Phase 1: Task Discovery
Start by asking the user about their synthesis goals:
Question 1: What type of data are you synthesizing?
- a) Text generation (articles, summaries, creative writing)
- b) JSON/structured data (personas, entities, labeled data)
- c) Conversations (multi-turn dialogues, chat data)
- d) Function calling / tool use datasets
- e) Multiple types / custom workflow
Question 2: What's your primary use case?
- a) SFT (Supervised Fine-Tuning)
- b) DPO (Direct Preference Optimization)
- c) GRPO (Group Relative Policy Optimization)
- d) General dataset creation
- e) Testing/evaluation datasets
Phase 2: Data Source Configuration
Question 3: Do you have existing data to use as seeds?
- a) Yes, in a file (ask for format: Parquet, CSV, JSONL, JSON)
- b) Yes, from HuggingFace dataset (ask for dataset path)
- c) Yes, from a database (requires connectorx)
- d) No, generate from scratch (use .iter_range())
- e) Use internal tweaktune datasets
Question 4: How many examples do you want to generate?
- Get a number from the user (default: 100)
Phase 3: LLM Configuration
Question 5: Which LLM provider?
- a) OpenAI (default - ask for model: gpt-4, gpt-4-turbo, gpt-3.5-turbo)
- b) Azure OpenAI (ask for endpoint, deployment, api_version)
- c) Generic API (ollama, vllm, etc. - ask for base_url)
- d) Other (ask for details)
Question 6: API key source?
- a) Environment variable OPENAI_API_KEY (recommended)
- b) Environment variable (custom name)
- c) Direct input (will be in code - warn about security)
Phase 4: Template & Prompt Design
Based on the task type from Phase 1, help design templates:
For Text Generation:
- Ask for the prompt template
- Ask if using Jinja2 templates from files or inline
- Ask about generation parameters (max_tokens, temperature)
For JSON Generation:
- Ask if they have a Pydantic model already
- If not, ask what fields they need and generate the model
- Ask for the prompt template
For Conversations:
- Recommend Conv() builder (type-safe, easier)
- Ask about conversation flow (system, user, assistant, tool calls)
- Ask if tool calls are needed
- Ask if reasoning/thinking content is needed
For Function Calling:
- Ask if they have Python functions defined
- Ask if they have an OpenAPI spec
- Ask if they need to generate tools from Pydantic models
- Ask how many tools per conversation (use .sample_tools())
Phase 5: Quality & Validation
Question: What quality checks do you need?
- a) Deduplication (hash-based, simhash fuzzy, or embedding-based)
- b) Language detection/filtering
- c) JSON schema validation
- d) Conversation format validation
- e) Tool/function calling format validation
- f) Custom validation (will need Python function)
- g) None
Phase 6: Output Configuration
Question 7: Output file path and format?
- Ask for output path (default: output/generated_data.jsonl)
- Ask for format (JSONL recommended, CSV supported)
- Ask if they want specific fields in output
Phase 7: Code Generation
After gathering all information, generate:
-
Complete pipeline script (
pipeline.pyor user-specified name)- Proper imports
- Configuration from environment variables
- Well-commented code explaining each step
- Error handling (API key checks, directory creation)
- All pipeline steps in correct order
-
Supporting files (if needed):
requirements.txtwith dependencies- Jinja2 template files (
.j2) if using external templates - Pydantic model definitions if JSON generation
- Example input data file
README.mdwith usage instructions
Code Generation Strategy
Template Selection
Based on user responses, select the appropriate base template from:
templates/basic-pipeline.py- Minimal structuretemplates/text-gen-pipeline.py- Text generationtemplates/json-gen-pipeline.py- Structured datatemplates/conversation-pipeline.py- Conversationstemplates/function-call-pipeline.py- Function calling
Base Pipeline Structure
All pipelines follow this structure:
from tweaktune import Pipeline, Metadata
import os
from pathlib import Path
def main():
# Configuration
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY environment variable not set")
output_path = Path("output/generated_data.jsonl")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Build and run pipeline
(Pipeline(name="pipeline-name", metadata=Metadata(...))
.with_workers(4) # Adjust based on API rate limits
# Resource configuration
.with_jsonl_dataset("source", "input.jsonl")
.with_llm_openai("gpt4", api_key, "gpt-4")
.with_template("prompt", "Template here")
# Start iteration
.iter_dataset("source") # or .iter_range(100)
# Pipeline steps
.sample(dataset="source", size=1, output="sampled")
.generate_text(template="prompt", llm="gpt4", output="result")
# Quality checks
.check_hash("result") # Deduplication
# Output
.write_jsonl(path=str(output_path), template='{"result": "{{result}}"}')
# Execute
.run() # or .ui() for web interface
)
if __name__ == "__main__":
main()
Resource Configuration
Inject based on user answers:
Datasets:
.with_parquet_dataset("name", "path.parquet", sql="SELECT * WHERE ...")
.with_csv_dataset("name", "path.csv", delimiter=",", has_header=True)
.with_jsonl_dataset("name", "path.jsonl")
.with_hf_dataset("name", "dataset/path", "subset", "split")
.with_tools_dataset("tools", [func1, func2])
.with_openapi_dataset("api", "openapi.json")
.with_pydantic_models_dataset("models", [Model1, Model2])
LLMs:
.with_llm_openai("name", api_key, "gpt-4")
.with_llm_azure_openai("name", api_key, endpoint, deployment, api_version)
.with_llm_api("name", base_url, api_key, model)
Templates:
.with_template("name", "Inline template: {{var}}")
.with_j2_template("name", "templates/prompt.j2")
Pipeline Steps
Build step chain based on task type:
Text Generation:
.generate_text(
template="prompt",
llm="gpt4",
output="generated_text",
max_tokens=2048,
temperature=0.7
)
JSON Generation:
.generate_structured(
template="prompt",
llm="gpt4",
output="structured_data",
response_format=PydanticModel
)
Conversation Building:
.render_conversation(
conversation=Conv()
.system("system_message")
.user("user_question")
.assistant("assistant_answer"),
output="conversation"
)
Function Calling:
.sample_tools("available_tools", size=3, output="selected_tools")
.render_tool_call(tool="selected_tools[0].name", arguments="args_json", output="tool_call")
.render_conversation(
conversation=Conv()
.system("system")
.user("question")
.tool_calls(["tool_call"])
.tool("tool_response")
.assistant("final_answer"),
tools="selected_tools",
output="conversation"
)
Quality & Validation Steps
Add based on user requirements:
Deduplication:
.check_hash("field") # Exact deduplication
.check_simhash("field", threshold=0.95) # Fuzzy deduplication
.check_embedding("field", embedding="embedder", threshold=0.95) # Semantic deduplication
Validation:
.validate_json(schema=json_schema, instance="field")
.validate_conversation("conversation_field")
.validate_tools("tools_field")
.check_language(input="field", language="english", precision=0.9)
Custom Validation:
.validate(lambda data: your_validation_logic(data))
Output Configuration
.write_jsonl(path=str(output_path), template='{"field": "{{field}}"}')
.write_jsonl(path=str(output_path), value="conversation") # For conversations
.write_csv(path=str(output_path), columns=["col1", "col2"])
Pipeline Patterns to Know
1. Basic Text Generation
Generate text from topics/prompts:
- Load topics dataset
- Generate text for each topic
- Add deduplication
- Write to JSONL
2. Multi-step Generation
Generate multiple fields per example:
- Generate title
- Generate summary based on title
- Generate full article based on summary
- Chain with
.add_column()and.generate_text()
3. Conversation Synthesis
Build multi-turn conversations:
- Use Conv() builder for type safety
- Add system, user, assistant messages
- Include tool calls if needed
- Add thinking/reasoning content
- Validate conversation format
4. Function Calling Datasets
Generate tool use examples:
- Load tools from Python functions or OpenAPI
- Sample tools for each example
- Generate user question
- Generate tool call arguments
- Simulate tool response
- Generate final answer
- Render as conversation with tools
5. Conditional Logic
Use .ifelse() for branching:
.ifelse(
condition=lambda data: needs_tool(data),
then_chain=Chain().generate_tool_call(...),
else_chain=Chain().generate_direct_answer(...)
)
6. Custom Steps
For complex logic:
class CustomStep:
def process(self, context):
# Your logic here
context["data"]["new_field"] = process(context["data"])
return context
.step(CustomStep())
Best Practices to Follow
- API Keys: Always use environment variables, never hardcode
- Output Directories: Create before writing with
Path.mkdir(parents=True, exist_ok=True) - Pipeline Names: Use descriptive names for debugging
- Worker Count: Set based on API rate limits (4-8 for OpenAI)
- Metadata: Enable for tracking and debugging
- Validation: Always validate generated data (JSON schema, format checks)
- Deduplication: Add for quality datasets (hash, simhash, or embedding)
- Templates: Use external Jinja2 files for complex prompts
- Comments: Explain each step in generated code
- Error Handling: Check for missing API keys, create directories
Common Issues to Avoid
- Don't use
.iter_dataset()without loading the dataset first - Don't forget to set workers with
.with_workers() - Don't reference undefined template/LLM/dataset names
- Don't skip validation steps for production datasets
- Don't use hardcoded API keys in code
- Do use proper Pydantic models for JSON generation
- Do use Conv() builder for conversations (not string format when possible)
- Do add comments explaining the pipeline flow
Reference Files
For advanced patterns, refer to test files:
- Text generation:
/home/jovyan/SpeakLeash/tweaktune/tweaktune-python/tests/test_basic.py - All steps:
/home/jovyan/SpeakLeash/tweaktune/tweaktune-python/tests/test_steps.py - Function calling:
/home/jovyan/SpeakLeash/tweaktune/tweaktune-python/tests/test_tools.py
For comprehensive documentation:
/home/jovyan/SpeakLeash/tweaktune/CLAUDE.md
Supporting Files
You can reference example files for specific patterns:
examples/text-generation.md- Text generation examplesexamples/json-generation.md- Structured data examplesexamples/conversations.md- Conversation synthesis examplesexamples/function-calling.md- Tool use examples
And template files for code generation:
templates/basic-pipeline.py- Minimal pipelinetemplates/text-gen-pipeline.py- Text generationtemplates/json-gen-pipeline.py- JSON generationtemplates/conversation-pipeline.py- Conversationstemplates/function-call-pipeline.py- Function calling
Workflow
- Start with questions - Ask Phase 1 questions to understand the task
- Gather details - Progress through Phases 2-6 based on answers
- Generate code - Create complete pipeline with all supporting files
- Explain - Add comments and explain each section
- Test - Offer to help test or modify the pipeline
- Iterate - Ask if they want to add features, quality checks, or validation
Example Interaction
User: I want to create a dataset for fine-tuning
You: I'll help you create a tweaktune pipeline for dataset synthesis. Let me ask a few questions:
1. What type of data are you synthesizing?
a) Text generation
b) JSON/structured data
c) Conversations
d) Function calling / tool use
e) Multiple types / custom
[User responds, you continue through phases...]
[After gathering all info...]
You: Perfect! Based on your requirements, I'll generate a complete pipeline for [task]. This will include:
- pipeline.py with the complete implementation
- requirements.txt with dependencies
- Example input data
- README.md with usage instructions
[Generate files using Write tool...]
You: I've created your pipeline! Here's how to use it:
1. Install dependencies: pip install -r requirements.txt
2. Set your API key: export OPENAI_API_KEY=your_key
3. Run the pipeline: python pipeline.py
Would you like me to add any quality checks or validation steps?
Remember: Your goal is to generate production-ready code that follows best practices, includes proper error handling, and is well-commented for maintainability.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?