Agent skill

tts-horus

Build and operate the Horus TTS pipeline from cleared audiobooks. Includes dataset prep, WhisperX alignment, XTTS training, voice coloring, and persona inference helpers.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/tts-horus

Metadata

Additional technical details for this skill

short description: Horus TTS dataset + training + inference pipeline

SKILL.md

Horus TTS Pipeline

This skill standardizes the end-to-end Horus voice workflow using uvx-style invocations so dependencies are isolated per command. It assumes recordings live under persona/data/audiobooks/ with per-book clean/ folders.

Commands

`dataset`

Builds the Horus dataset from Horus Rising using faster-whisper segmentation.

bash

# Step 1: Transcribe and segment
python run/tts/ingest_audiobook.py \
  --audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
  --book-name Horus_Rising \
  --output-dir datasets/horus_voice \
  --max-hours 0

# Step 2: Extract clips (CPU-bound, uses ffmpeg)
python run/tts/extract_clips.py \
  --segments datasets/horus_voice/segments_merged.jsonl \
  --audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
  --output-dir datasets/horus_voice/clips \
  --manifest datasets/horus_voice/full_manifest.jsonl \
  --min-sec 1.5 --max-sec 15.0 --sample-rate 24000 --max-hours 0

`align`

Runs WhisperX alignment with lexicon overrides.

bash

python run/tts/align_transcripts.py \
  --manifest datasets/horus_voice/train_manifest.jsonl \
  --output datasets/horus_voice/train_aligned.jsonl \
  --dataset-root datasets/horus_voice \
  --lexicon persona/docs/lexicon_overrides.json \
  --strategy whisperx --device cuda

`train`

Fine-tunes XTTS-v2 using GPTTrainer (local A5000 by default).

bash

python run/tts/train_xtts_coqui.py --config configs/tts/horus_xtts.yaml

Important: Uses train_xtts_coqui.py with GPTTrainer (not generic Trainer). Must use mixed_precision=False to avoid NaN losses.

`say`

CLI synthesis (writes .artifacts/tts/output.wav by default).

bash

python run/tts/say.py "Lupercal speaks."

`server`

FastAPI server for low-latency synthesis.

bash

python run/tts/server.py

`color`

Voice coloring helper (to be implemented in Task 5).

bash

python run/tts/color_voice.py --base horus --color warm --alpha 0.4

Notes

WhisperX is installed in the project .venv; for standalone runs, use uvx whisperx if needed.
Golden samples live in tests/fixtures/tts/golden/ (Git LFS).
Orchestrate-ready task plan lives at persona/docs/tasks/0N_voice_coloring.md.

Gotchas

Issue	Solution
NaN losses from step 0	Use `mixed_precision=False`, `precision="float32"`
Model size mismatch (1026 vs 8194)	Use GPTTrainer with GPTArgs, not XttsConfig
Missing dvae.pth	Download from HuggingFace: `wget https://huggingface.co/coqui/XTTS-v2/resolve/main/dvae.pth`
Clips too short (rejected)	Merge adjacent segments: MIN_TARGET=2.0s, MAX_GAP=0.5s
Clip extraction slow	Normal - uses ffmpeg (CPU-bound), GPU only for training
Learning rate	Use 5e-6 (official recipe), NOT default 2e-4
Batch size	batch_size * grad_accumulation >= 252 for efficient training
LR milestones never reached	Script now auto-calculates based on dataset size

Learning Rate Schedule

The training script uses professional ML best practices:

LR Schedule: CosineAnnealingWarmRestarts (recommended for XTTS)
- T_0: ~20% of total steps (first annealing cycle)
- T_mult: 2 (each subsequent cycle doubles in length)
- eta_min: 1e-7 (minimum LR floor)
- Gradient clipping: 1.0
- Optimizer: AdamW with weight decay 1e-2

Why CosineAnnealingWarmRestarts over MultiStepLR:

Smoother LR decay (better for fine-tuning)
Periodic restarts help escape local minima
Recommended by Coqui community for XTTS

Double descent is less relevant for fine-tuning because:

Model already trained - we're adapting, not learning from scratch
Single speaker - we want specialization to Horus's voice
Risk is memorization, mitigated by diverse training data (~18k clips)

Sources:

Automated Pipeline

The TTS pipeline can run fully automated from extraction to training:

bash

# Start extraction, then auto-train when complete
./run/tts/auto_train_after_extraction.sh [extraction_pid]

# Or run the full pipeline from scratch
python run/tts/horus_pipeline.py --skip-align

The auto script will:

Monitor extraction progress (logs every 60s)
Create train/val manifests (98%/2% split)
Build Coqui metadata.csv
Start XTTS training (200 epochs)

Logs: logs/tts/pipeline_*.log

Current Status (2026-01-23)

Audiobook: Horus Rising (~12h)
Extraction: In progress (~9,400 / ~21,975 clips)
Auto-monitor: Running (PID in logs/tts/auto_pipeline.log)
Config: configs/tts/horus_xtts.yaml (200 epochs, 5e-6 LR)
Training: Will auto-start after extraction

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/tts-horus
License: MIT License

Featured Tools

Join Our Newsletter

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Horus TTS Pipeline

Commands

dataset

align

train

say

server

color

Notes

Gotchas

Learning Rate Schedule

Automated Pipeline

Current Status (2026-01-23)

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state

`dataset`

`align`

`train`

`say`

`server`

`color`