Agent skill
audio-generation
Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.
Install this agent skill to your Project
npx add-skill https://github.com/massgen/MassGen/tree/main/massgen/skills/audio-generation
SKILL.md
Audio Generation
Generate audio using generate_media with mode="audio". Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.
Quick Start
# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")
# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")
# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
audio_type="music", duration=30)
# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
audio_type="sound_effect", duration=5)
Audio Types
| Type | Backends | Description |
|---|---|---|
"speech" (default) |
ElevenLabs, OpenAI | Text-to-speech with voice selection |
"music" |
ElevenLabs only | Music generation from text prompt |
"sound_effect" |
ElevenLabs only | Sound effect generation |
"voice_conversion" |
ElevenLabs only | Change voice of existing audio (speech-to-speech) |
"audio_isolation" |
ElevenLabs only | Remove background noise, isolate vocals |
"voice_design" |
ElevenLabs only | Create a new synthetic voice from text description |
"voice_clone" |
ElevenLabs only | Clone a voice from audio samples |
"dubbing" |
ElevenLabs only | Translate and dub audio to another language |
Backend Comparison
| Backend | Default Model | Supports | API Key |
|---|---|---|---|
| ElevenLabs (priority 1) | eleven_multilingual_v2 |
Speech, music, SFX | ELEVENLABS_API_KEY |
| OpenAI (priority 2) | gpt-4o-mini-tts |
Speech only | OPENAI_API_KEY |
If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.
Key Parameters
| Parameter | Description | Example |
|---|---|---|
prompt |
Text to speak (speech) or description (music/SFX) | "Hello world!" |
voice |
Voice name or ID | "Rachel", "nova", "alloy" |
audio_type |
Type of audio | "speech", "music", "sound_effect" |
duration |
Length in seconds (music/SFX only) | 30 |
instructions |
Speaking style (OpenAI gpt-4o-mini-tts only) |
"warm, reflective tone" |
audio_format |
Output format | "mp3", "wav", "opus" |
Voice Quick Reference
ElevenLabs (top voices):
| Voice | Character |
|---|---|
| Rachel | Warm, conversational female |
| Sarah | Clear, professional female |
| Josh | Friendly male |
| Adam | Deep, authoritative male |
| Emily | Bright, energetic female |
OpenAI voices: alloy, echo, fable, onyx, nova, shimmer, coral, sage
Important: prompt vs instructions
For speech, prompt is the literal text to speak. Style guidance goes in instructions:
# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
prompt="Welcome to the annual report presentation.",
mode="audio",
voice="alloy",
instructions="warm, reflective tone with measured pacing",
backend_type="openai"
)
# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!
instructions only works with OpenAI gpt-4o-mini-tts. ElevenLabs uses voice selection for tone.
Audio Understanding
Use read_media (not generate_media) to analyze existing audio:
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")
Need More Control?
- Full ElevenLabs voice catalog (28+ voices): See references/voices.md
- Music and sound effects details: See references/music_and_sfx.md
- Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
textual-ui-developer
Develop and debug the MassGen Textual TUI with deterministic replay, snapshot regression tests, and targeted runtime checks.
evolving-skill-creator
Guide for creating evolving skills - detailed workflow plans that capture what you'll do, what tools you'll create, and learnings from execution. Use this when starting a new task that could benefit from a reusable workflow.
pr-checks
Run comprehensive PR checks including reviewing CodeRabbit comments, ensuring PR description quality, running pre-commit hooks, tests, and validation. Use on an existing PR to address review feedback.
serena
This skill provides symbol-level code understanding and navigation using Language Server Protocol (LSP). Enables IDE-like capabilities for finding symbols, tracking references, and making precise code edits at the symbol level.
massgen-config-creator
Guide for creating properly structured YAML configuration files for MassGen. This skill should be used when agents need to create new configs for examples, case studies, testing, or demonstrating features.
semtools
This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.
Didn't find tool you were looking for?