Agent skill

tts

Text-to-speech and voice narration. Triggers on: "朗读这段", "配音", "TTS", "语音合成", "text to speech", "read this aloud", "convert to speech", "voice narration", "read aloud".

View SKILL.md on GitHub Repository

Stars 37

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/marswaveai/skills/tree/main/tts

Metadata

Additional technical details for this skill

openclaw: { "emoji": "\ud83d\udd0a", "requires": { "env": [ "LISTENHUB_API_KEY" ] }, "primaryEnv": "LISTENHUB_API_KEY" }

SKILL.md

When to Use

User wants to convert text to spoken audio
User asks for "read aloud", "TTS", "text to speech", "voice narration"
User says "朗读", "配音", "语音合成"
User wants multi-speaker scripted audio or dialogue

When NOT to Use

User wants a podcast-style discussion with topic exploration (use /podcast)
User wants an explainer video with visuals (use /explainer)
User wants to generate an image (use /image-gen)

Purpose

Convert text into natural-sounding speech audio. Two paths:

Quick mode (/v1/tts): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio.
Script mode (/v1/speech): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.

Hard Constraints

No shell scripts. Construct curl commands from the API reference files listed in Resources
Always read shared/authentication.md for API key and headers
Follow shared/common-patterns.md for errors and interaction patterns
Never hardcode speaker IDs in API calls — use built-in defaults from shared/speaker-selection.md as fallback only; fetch from the speakers API when the user wants to change voice
Always read config following shared/config-pattern.md before any interaction
Always follow shared/speaker-selection.md for speaker selection (text table + free-text input)
Never save files to ~/Downloads/ or /tmp/ as primary output — save artifacts to the current working directory with friendly topic-based names (see shared/config-pattern.md § Artifact Naming)

Mode Detection

Determine the mode from the user's input automatically before asking any questions:

Signal	Mode
"多角色", "脚本", "对话", "script", "dialogue", "multi-speaker"	Script
Multiple characters mentioned by name or role	Script
Input contains structured segments (A: ..., B: ...)	Script
Single paragraph of text, no character markers	Quick
"读一下", "read this", "TTS", "朗读" with plain text	Quick
Ambiguous	Quick (default)

Interaction Flow

Step -1: API Key Check

Follow shared/config-pattern.md § API Key Check. If the key is missing, stop immediately.

Step 0: Config Setup

Follow shared/config-pattern.md Step 0 (Zero-Question Boot).

If file doesn't exist — silently create with defaults and proceed:

bash

mkdir -p ".listenhub/tts"
echo '{"outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Do NOT ask any setup questions. Proceed directly to the Interaction Flow.

If file exists — read config silently and proceed:

bash

CONFIG_PATH=".listenhub/tts/config.json"
[ ! -f "$CONFIG_PATH" ] && CONFIG_PATH="$HOME/.listenhub/tts/config.json"
CONFIG=$(cat "$CONFIG_PATH")

Setup Flow (user-initiated reconfigure only)

Only run when the user explicitly asks to reconfigure. Display current settings:

当前配置 (tts)：
  输出方式：{inline / download / both}
  语言偏好：{zh / en / 未设置}
  默认主播：{speakerName / 使用内置默认}

Then ask:

outputMode: Follow shared/output-mode.md § Setup Flow Question.
Language (optional): "默认语言？"
- "中文 (zh)"
- "English (en)"
- "每次手动选择" → keep null

After collecting answers, save immediately:

bash

NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
# Save language if user chose one (not "每次手动选择")
if [ "$LANGUAGE" != "null" ]; then
  NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "$LANGUAGE" '. + {"language": $lang}')
fi
echo "$NEW_CONFIG" > "$CONFIG_PATH"
CONFIG=$(cat "$CONFIG_PATH")

Quick Mode — `POST /v1/tts`

Step 1: Extract text

Get the text to convert. If the user hasn't provided it, ask:

"What text would you like me to read aloud?"

Step 2: Determine voice

If config.defaultSpeakers.{language}[0] is set → use it silently (skip to Step 4)
If not set → use the built-in default from shared/speaker-selection.md for the detected language (skip to Step 4)
Only show speaker selection if the user explicitly asks to change voice

Step 3: Save preference

After the user explicitly selects a new voice (not when using defaults):

Question: "Save {voice name} as your default voice for {language}?"
Options:
  - "Yes" — update .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Text: "{first 80 chars}..."
  Voice: {voice name}

Proceed?

Step 5: Generate

bash

curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

Use a timestamped jobId: $(date +%s)

inline or both (TTS quick returns a sync audio stream — no audioUrl):

bash

JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3

Then use the Read tool on /tmp/tts-{jobId}.mp3.

Present:

Audio generated!

download or both: Generate a topic slug from the text content following shared/config-pattern.md § Artifact Naming.

bash

SLUG="{topic-slug}"  # e.g. "server-maintenance-notice"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d '{"input": "...", "voice": "..."}' \
  --output "$NAME"

Present:

Audio generated!

已保存到当前目录：
  {NAME}

Script Mode — `POST /v1/speech`

Step 1: Get scripts

Determine whether the user already has a scripts array:

Already provided (JSON or clear segments): parse and display for confirmation
Not yet provided: help the user structure segments. Ask:

"Please provide the script with speaker assignments. Format: each line as SpeakerName: text content. I'll convert it."

Once the user provides the script, parse it into the scripts JSON format.

Step 2: Assign voices per character

For each unique character in the script:

If config.defaultSpeakers.{language} has saved voices → auto-assign silently (one per character in order)
If not set → use built-in defaults from shared/speaker-selection.md (Primary for first character, Secondary for second)
Only show speaker selection if the user explicitly asks to change voices

Step 3: Save preferences

After all voices are assigned (if any were new):

Question: "Save these voice assignments for future sessions?"
Options:
  - "Yes" — update defaultSpeakers in .listenhub/tts/config.json
  - "No" — use for this session only

Step 4: Confirm

Ready to generate:

  Characters:
    {name}: {voice}
    {name}: {voice}
  Segments: {count}
  Title: (auto-generated)

Proceed?

Step 5: Generate

Write the request body to a temp file, then submit:

bash

# Write request to temp file
cat > /tmp/lh-speech-request.json << 'ENDJSON'
{
  "scripts": [
    {"content": "...", "speakerId": "..."},
    {"content": "...", "speakerId": "..."}
  ]
}
ENDJSON

# Submit
curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Source: skills" \
  -d @/tmp/lh-speech-request.json

rm /tmp/lh-speech-request.json

Step 6: Present result

Read OUTPUT_MODE from config. Follow shared/output-mode.md for behavior.

inline or both: Display the audioUrl and subtitlesUrl as clickable links.

Present:

Audio generated!

在线收听：{audioUrl}
字幕：{subtitlesUrl}
时长：{audioDuration / 1000}s
消耗积分：{credits}

download or both: Also download the file. Generate a topic slug following shared/config-pattern.md § Artifact Naming.

bash

SLUG="{topic-slug}"  # e.g. "welcome-dialogue"
NAME="${SLUG}.mp3"
# Dedup: if file exists, append -2, -3, etc.
BASE="${NAME%.*}"; EXT="${NAME##*.}"; i=2
while [ -e "$NAME" ]; do NAME="${BASE}-${i}.${EXT}"; i=$((i+1)); done
curl -sS -o "$NAME" "{audioUrl}"

Present:

已保存到当前目录：
  {NAME}

Updating Config

When saving preferences, merge into .listenhub/tts/config.json — do not overwrite unchanged keys.

Quick voice: set defaultSpeakers.{language}[0] to the selected speakerId
Script voices: set defaultSpeakers.{language} to the full array assigned this session
Language: set language if the user explicitly specifies it

API Reference

TTS & Speech endpoints: shared/api-tts.md
Speaker list: shared/api-speakers.md
Speaker selection guide: shared/speaker-selection.md
Error handling: shared/common-patterns.md § Error Handling
Long text input: shared/common-patterns.md § Long Text Input

Composability

Invokes: speakers API (for speaker selection)
Invoked by: explainer (for voiceover)

Examples

Quick mode:

"TTS this: The server will be down for maintenance at midnight."

Detect: Quick mode (plain text, "TTS this")
Read config: quickVoice is null
Fetch speakers, user picks "Yuanye"
Ask to save → yes → update config
POST /v1/tts with input + voice
Present: /tmp/tts-output.mp3

Script mode:

"帮我做一段双人对话配音，A说：欢迎大家，B说：谢谢邀请"

Detect: Script mode ("双人对话")
Parse segments: A → "欢迎大家", B → "谢谢邀请"
Read config: scriptVoices empty
Fetch zh speakers, assign A and B voices
Ask to save → yes → update config
POST /v1/speech with scripts array
Present: audioUrl, subtitlesUrl, duration

Maintainer

marswaveai Core maintainer

Source details

Full Name: marswaveai/skills
Branch: main
Path in repo: tts
License: MIT License
Topics: claude cursor skills aigc podcast slides videos

Featured Tools

Join Our Newsletter

DEPRECATED — replaced by individual skills. Use when the user triggers any ListenHub action: "make a podcast", "explainer video", "read aloud", "TTS", "generate image", "解说视频", "播客", "朗读", "生成图片".

37 1

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

When to Use

When NOT to Use

Purpose

Hard Constraints

Mode Detection

Interaction Flow

Step -1: API Key Check

Step 0: Config Setup

Setup Flow (user-initiated reconfigure only)

Quick Mode — POST /v1/tts

Script Mode — POST /v1/speech

Updating Config

API Reference

Composability

Examples

Recommended Agent Skills

explainer

content-parser

image-gen

podcast

creator

listenhub

Quick Mode — `POST /v1/tts`

Script Mode — `POST /v1/speech`