Agent skill
tts-horus
Build and operate the Horus TTS pipeline from cleared audiobooks. Includes dataset prep, WhisperX alignment, XTTS training, voice coloring, and persona inference helpers.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/tts-horus
Metadata
Additional technical details for this skill
- short description
- Horus TTS dataset + training + inference pipeline
SKILL.md
Horus TTS Pipeline
This skill standardizes the end-to-end Horus voice workflow using uvx-style
invocations so dependencies are isolated per command. It assumes recordings
live under persona/data/audiobooks/ with per-book clean/ folders.
Commands
dataset
Builds the Horus dataset from Horus Rising using faster-whisper segmentation.
# Step 1: Transcribe and segment
python run/tts/ingest_audiobook.py \
--audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
--book-name Horus_Rising \
--output-dir datasets/horus_voice \
--max-hours 0
# Step 2: Extract clips (CPU-bound, uses ffmpeg)
python run/tts/extract_clips.py \
--segments datasets/horus_voice/segments_merged.jsonl \
--audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
--output-dir datasets/horus_voice/clips \
--manifest datasets/horus_voice/full_manifest.jsonl \
--min-sec 1.5 --max-sec 15.0 --sample-rate 24000 --max-hours 0
align
Runs WhisperX alignment with lexicon overrides.
python run/tts/align_transcripts.py \
--manifest datasets/horus_voice/train_manifest.jsonl \
--output datasets/horus_voice/train_aligned.jsonl \
--dataset-root datasets/horus_voice \
--lexicon persona/docs/lexicon_overrides.json \
--strategy whisperx --device cuda
train
Fine-tunes XTTS-v2 using GPTTrainer (local A5000 by default).
python run/tts/train_xtts_coqui.py --config configs/tts/horus_xtts.yaml
Important: Uses train_xtts_coqui.py with GPTTrainer (not generic Trainer).
Must use mixed_precision=False to avoid NaN losses.
say
CLI synthesis (writes .artifacts/tts/output.wav by default).
python run/tts/say.py "Lupercal speaks."
server
FastAPI server for low-latency synthesis.
python run/tts/server.py
color
Voice coloring helper (to be implemented in Task 5).
python run/tts/color_voice.py --base horus --color warm --alpha 0.4
Notes
- WhisperX is installed in the project
.venv; for standalone runs, useuvx whisperxif needed. - Golden samples live in
tests/fixtures/tts/golden/(Git LFS). - Orchestrate-ready task plan lives at
persona/docs/tasks/0N_voice_coloring.md.
Gotchas
| Issue | Solution |
|---|---|
| NaN losses from step 0 | Use mixed_precision=False, precision="float32" |
| Model size mismatch (1026 vs 8194) | Use GPTTrainer with GPTArgs, not XttsConfig |
| Missing dvae.pth | Download from HuggingFace: wget https://huggingface.co/coqui/XTTS-v2/resolve/main/dvae.pth |
| Clips too short (rejected) | Merge adjacent segments: MIN_TARGET=2.0s, MAX_GAP=0.5s |
| Clip extraction slow | Normal - uses ffmpeg (CPU-bound), GPU only for training |
| Learning rate | Use 5e-6 (official recipe), NOT default 2e-4 |
| Batch size | batch_size * grad_accumulation >= 252 for efficient training |
| LR milestones never reached | Script now auto-calculates based on dataset size |
Learning Rate Schedule
The training script uses professional ML best practices:
LR Schedule: CosineAnnealingWarmRestarts (recommended for XTTS)
- T_0: ~20% of total steps (first annealing cycle)
- T_mult: 2 (each subsequent cycle doubles in length)
- eta_min: 1e-7 (minimum LR floor)
- Gradient clipping: 1.0
- Optimizer: AdamW with weight decay 1e-2
Why CosineAnnealingWarmRestarts over MultiStepLR:
- Smoother LR decay (better for fine-tuning)
- Periodic restarts help escape local minima
- Recommended by Coqui community for XTTS
Double descent is less relevant for fine-tuning because:
- Model already trained - we're adapting, not learning from scratch
- Single speaker - we want specialization to Horus's voice
- Risk is memorization, mitigated by diverse training data (~18k clips)
Sources:
Automated Pipeline
The TTS pipeline can run fully automated from extraction to training:
# Start extraction, then auto-train when complete
./run/tts/auto_train_after_extraction.sh [extraction_pid]
# Or run the full pipeline from scratch
python run/tts/horus_pipeline.py --skip-align
The auto script will:
- Monitor extraction progress (logs every 60s)
- Create train/val manifests (98%/2% split)
- Build Coqui metadata.csv
- Start XTTS training (200 epochs)
Logs: logs/tts/pipeline_*.log
Current Status (2026-01-23)
- Audiobook: Horus Rising (~12h)
- Extraction: In progress (~9,400 / ~21,975 clips)
- Auto-monitor: Running (PID in
logs/tts/auto_pipeline.log) - Config:
configs/tts/horus_xtts.yaml(200 epochs, 5e-6 LR) - Training: Will auto-start after extraction
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?