Agent skill
tts
Text-to-speech and voice narration. Triggers on: "朗读这段", "配音", "TTS", "语音合成", "text to speech", "read this aloud", "convert to speech", "voice narration", "read aloud".
Install this agent skill to your Project
npx add-skill https://github.com/marswaveai/skills/tree/main/tts
Metadata
Additional technical details for this skill
- openclaw
-
{ "emoji": "\ud83d\udd0a", "requires": { "env": [ "LISTENHUB_API_KEY" ] }, "primaryEnv": "LISTENHUB_API_KEY" }
SKILL.md
When to Use
- User wants to convert text to spoken audio
- User asks for "read aloud", "TTS", "text to speech", "voice narration"
- User says "朗读", "配音", "语音合成"
- User wants multi-speaker scripted audio or dialogue
When NOT to Use
- User wants a podcast-style discussion with topic exploration (use
/podcast) - User wants an explainer video with visuals (use
/explainer) - User wants to generate an image (use
/image-gen)
Purpose
Convert text into natural-sounding speech audio. Two paths:
- Quick mode (
/v1/tts): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio. - Script mode (
/v1/speech): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.
Hard Constraints
- No shell scripts. Construct curl commands from the API reference files listed in Resources
- Always read
shared/authentication.mdfor API key and headers - Follow
shared/common-patterns.mdfor errors and interaction patterns - Never hardcode speaker IDs in API calls — use built-in defaults from
shared/speaker-selection.mdas fallback only; fetch from the speakers API when the user wants to change voice - Always read config following
shared/config-pattern.mdbefore any interaction - Always follow
shared/speaker-selection.mdfor speaker selection (text table + free-text input) - Never save files to
~/Downloads/or/tmp/as primary output — save artifacts to the current working directory with friendly topic-based names (seeshared/config-pattern.md§ Artifact Naming)
Mode Detection
Determine the mode from the user's input automatically before asking any questions:
| Signal | Mode |
|---|---|
| "多角色", "脚本", "对话", "script", "dialogue", "multi-speaker" | Script |
| Multiple characters mentioned by name or role | Script |
| Input contains structured segments (A: ..., B: ...) | Script |
| Single paragraph of text, no character markers | Quick |
| "读一下", "read this", "TTS", "朗读" with plain text | Quick |
| Ambiguous | Quick (default) |
Interaction Flow
Step -1: API Key Check
Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.
Step 0: Config Setup
Follow shared/config-pattern.md Step 0 (Zero-Question Boot).
If file doesn't exist — silently create with defaults and proceed:
mkdir -p ".listenhub/tts"
echo '{"outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")
Do NOT ask any setup questions. Proceed directly to the Interaction Flow.
If file exists — read config silently and proceed:
CONFIG_PATH=".listenhub/tts/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")
Setup Flow (user-initiated reconfigure only)
Only run when the user explicitly asks to reconfigure. Display current settings:
当前配置 (tts):
输出方式:{inline / download / both}
语言偏好:{zh / en / 未设置}
默认主播:{speakerName / 使用内置默认}
Then ask:
-
outputMode: Follow
shared/output-mode.md§ Setup Flow Question. -
Language (optional): "默认语言?"
- "中文 (zh)"
- "English (en)"
- "每次手动选择" → keep
null
After collecting answers, save immediately:
NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# Save language if user chose one (not "每次手动选择")
if [ "$LANGUAGE" != "null" ]; then
NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "$LANGUAGE" '. + {"language": $lang}')
fi
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")
Quick Mode — POST /v1/tts
Step 1: Extract text
Get the text to convert. If the user hasn't provided it, ask:
"What text would you like me to read aloud?"
Step 2: Determine voice
- If
config.defaultSpeakers.{language}[0]is set → use it silently (skip to Step 4) - If not set → use the built-in default from
shared/speaker-selection.mdfor the detected language (skip to Step 4) - Only show speaker selection if the user explicitly asks to change voice
Step 3: Save preference
After the user explicitly selects a new voice (not when using defaults):
Question: "Save {voice name} as your default voice for {language}?"
Options:
- "Yes" — update .listenhub/tts/config.json
- "No" — use for this session only
Step 4: Confirm
Ready to generate:
Text: "{first 80 chars}..."
Voice: {voice name}
Proceed?
Step 5: Generate
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Source: skills" \
-d '{"input": "...", "voice": "..."}' \
--output /tmp/tts-output.mp3
Step 6: Present result
Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.
Use a timestamped jobId: $(date +%s)
inline or both (TTS quick returns a sync audio stream — no audioUrl):
JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Source: skills" \
-d '{"input": "...", "voice": "..."}' \
--output /tmp/tts-${JOB_ID}.mp3
Then use the Read tool on /tmp/tts-{jobId}.mp3.
Present:
Audio generated!
download or both: Generate a topic slug from the text content following shared/config-pattern.md § Artifact Naming.
SLUG="{topic-slug}" # e.g. "server-maintenance-notice"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Source: skills" \
-d '{"input": "...", "voice": "..."}' \
--output "$NAME"
Present:
Audio generated!
已保存到当前目录:
{NAME}
Script Mode — POST /v1/speech
Step 1: Get scripts
Determine whether the user already has a scripts array:
-
Already provided (JSON or clear segments): parse and display for confirmation
-
Not yet provided: help the user structure segments. Ask:
"Please provide the script with speaker assignments. Format: each line as
SpeakerName: text content. I'll convert it."Once the user provides the script, parse it into the
scriptsJSON format.
Step 2: Assign voices per character
For each unique character in the script:
- If
config.defaultSpeakers.{language}has saved voices → auto-assign silently (one per character in order) - If not set → use built-in defaults from
shared/speaker-selection.md(Primary for first character, Secondary for second) - Only show speaker selection if the user explicitly asks to change voices
Step 3: Save preferences
After all voices are assigned (if any were new):
Question: "Save these voice assignments for future sessions?"
Options:
- "Yes" — update defaultSpeakers in .listenhub/tts/config.json
- "No" — use for this session only
Step 4: Confirm
Ready to generate:
Characters:
{name}: {voice}
{name}: {voice}
Segments: {count}
Title: (auto-generated)
Proceed?
Step 5: Generate
Write the request body to a temp file, then submit:
# Write request to temp file
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
"scripts": [
{"content": "...", "speakerId": "..."},
{"content": "...", "speakerId": "..."}
]
}
ENDJSON
# Submit
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
-H "Authorization: Bearer $LISTENHUB_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Source: skills" \
-d @/tmp/lh-speech-request.json
rm /tmp/lh-speech-request.json
Step 6: Present result
Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.
inline or both: Display the audioUrl and subtitlesUrl as clickable links.
Present:
Audio generated!
在线收听:{audioUrl}
字幕:{subtitlesUrl}
时长:{audioDuration / 1000}s
消耗积分:{credits}
download or both: Also download the file. Generate a topic slug following shared/config-pattern.md § Artifact Naming.
SLUG="{topic-slug}" # e.g. "welcome-dialogue"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -o "$NAME" "{audioUrl}"
Present:
已保存到当前目录:
{NAME}
Updating Config
When saving preferences, merge into .listenhub/tts/config.json — do not overwrite unchanged keys.
- Quick voice: set
defaultSpeakers.{language}[0]to the selectedspeakerId - Script voices: set
defaultSpeakers.{language}to the full array assigned this session - Language: set
languageif the user explicitly specifies it
API Reference
- TTS & Speech endpoints:
shared/api-tts.md - Speaker list:
shared/api-speakers.md - Speaker selection guide:
shared/speaker-selection.md - Error handling:
shared/common-patterns.md§ Error Handling - Long text input:
shared/common-patterns.md§ Long Text Input
Composability
- Invokes: speakers API (for speaker selection)
- Invoked by: explainer (for voiceover)
Examples
Quick mode:
"TTS this: The server will be down for maintenance at midnight."
- Detect: Quick mode (plain text, "TTS this")
- Read config:
quickVoiceisnull - Fetch speakers, user picks "Yuanye"
- Ask to save → yes → update config
POST /v1/ttswithinput+voice- Present:
/tmp/tts-output.mp3
Script mode:
"帮我做一段双人对话配音,A说:欢迎大家,B说:谢谢邀请"
- Detect: Script mode ("双人对话")
- Parse segments: A → "欢迎大家", B → "谢谢邀请"
- Read config:
scriptVoicesempty - Fetch
zhspeakers, assign A and B voices - Ask to save → yes → update config
POST /v1/speechwith scripts array- Present:
audioUrl,subtitlesUrl, duration
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
explainer
Create explainer videos with narration and AI-generated visuals. Triggers on: "解说视频", "explainer video", "explain this as a video", "tutorial video", "introduce X (video)", "解释一下XX(视频形式)".
content-parser
Extract and parse content from URLs. Triggers on: user provides a URL to extract content from, another skill needs to parse source material, "parse this URL", "extract content", "解析链接", "提取内容".
image-gen
Generate AI images from text prompts. Triggers on: "生成图片", "画一张", "AI图", "generate image", "配图", "create picture", "draw", "visualize", "generate an image".
podcast
Create podcasts from topics, URLs, or text. Triggers on: "做播客", "podcast", "播客", "录一期节目", "chat about", "discuss", "debate", "dialogue", "make a podcast about".
creator
Creator workflow — generate platform-ready content packages. Triggers on: "创作", "写公众号", "小红书", "口播", "creator", "content workflow", "帮我写一篇", "生成内容", "write an article", "create content".
listenhub
DEPRECATED — replaced by individual skills. Use when the user triggers any ListenHub action: "make a podcast", "explainer video", "read aloud", "TTS", "generate image", "解说视频", "播客", "朗读", "生成图片".
Didn't find tool you were looking for?