Agent skill

tts-train

Build TTS datasets and train Qwen3-TTS voice models from audiobooks or curated clips. Use for voice cloning, dataset prep, WhisperX alignment, Qwen3-TTS training, and inference.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/tts-train

Metadata

Additional technical details for this skill

short description
End-to-end TTS dataset prep + training workflow for Qwen3-TTS

SKILL.md

TTS Train Workflow

This skill provides a general, composable workflow for building datasets and training Qwen3-TTS voice models. It has its own pyproject.toml (self-contained) for Qwen3-TTS + TensorBoard.

Use the bundled run.sh so the correct environment is selected per step:

  • Project env for ingest/alignment (uses whisperx + faster-whisper).
  • Skill env for Qwen3-TTS training/inference + TensorBoard (avoids pandas conflicts).
bash
.agent/skills/tts-train/run.sh <command> ...

Dataset Build Options

Option A: Audiobook → Segments (fast path)

Use the audiobook-ingest skill for ingestion, then hand off to TTS:

bash
.agent/skills/audiobook-ingest/run.sh handoff-tts "<book_dir_name>"
.agent/skills/audiobook-ingest/run.sh align-tts

Direct ingest (if you want full control):

bash
.agent/skills/tts-train/run.sh ingest \
  persona/data/audiobooks/<book>/audio.m4b \
  <voice_name> \
  datasets/<voice_name>

Option B: Curated Clips + Transcripts

Provide a JSONL file with audio_file (relative to input dir) and text.

bash
.agent/skills/tts-train/run.sh ingest-transcript \
  data/<voice>/audio_raw \
  data/<voice>/transcripts.jsonl \
  datasets/<voice>

Alignment (WhisperX)

bash
.agent/skills/tts-train/run.sh align \
  datasets/<voice>/train_manifest.jsonl \
  datasets/<voice>/train_aligned.jsonl \
  datasets/<voice>

Iterative Training Loop (Recommended)

The iterative training loop provides a two-phase workflow for optimal TTS model training:

Why Two Phases?

Phase Purpose Optimizes
Hyperparameter Search Find optimal training config Training efficiency (loss curves)
Evaluation Loop Train until quality threshold Output quality (voice similarity)

Key insight: Loss ≠ Quality. A model with low training loss can still produce poor voice output (wrong prosody, artifacts, speaker drift). That's why we need BOTH phases.

Full Workflow (Recommended)

bash
# Full workflow: hyperparameter search + iterative training
python .agent/skills/tts-train/iterative_train.py \
  --model-path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --data datasets/<voice>/train_manifest_qwen3.jsonl \
  --output artifacts/tts/<voice>_iterative \
  --run-hyperparameter-search \
  --hp-trials 10 \
  --max-iterations 5 \
  --quality-threshold 3.5

Evaluation-Only (Skip Hyperparameter Search)

Use when you already have good hyperparameters from a previous search:

bash
# With hyperparams file
python .agent/skills/tts-train/iterative_train.py \
  --model-path artifacts/tts/<voice>/checkpoint-epoch-0 \
  --data datasets/<voice>/train_manifest_qwen3.jsonl \
  --output artifacts/tts/<voice>_refined \
  --hyperparams best_config.json \
  --max-iterations 3

# Auto-evaluation mode (no manual rating)
python .agent/skills/tts-train/iterative_train.py \
  --model-path artifacts/tts/<voice>/checkpoint-epoch-0 \
  --data datasets/<voice>/train_manifest_qwen3.jsonl \
  --output artifacts/tts/<voice>_refined \
  --max-iterations 3 \
  --auto-evaluate

Custom Evaluation Phrases

bash
python .agent/skills/tts-train/iterative_train.py \
  --model-path /path/to/model \
  --data /path/to/data.jsonl \
  --output /path/to/output \
  --eval-phrases "I am the Warmaster" "The Emperor betrayed us" \
  --epochs-per-iteration 2

How It Works

Phase 1: Hyperparameter Search (if --run-hyperparameter-search):

  1. Uses Bayesian optimization (Optuna) to search config space
  2. Runs short smoke runs (300 steps) per trial
  3. Optimizes: learning rate, LoRA config, warmup, weight decay
  4. Saves best config to hyperparameter_search/best_config.json

Phase 2: Evaluation Loop:

  1. Trains for N epochs with optimal config
  2. Generates audio samples for evaluation phrases
  3. Rates quality (manual 1-5 or auto-heuristics)
  4. If quality threshold met → STOP
  5. If not → continue from checkpoint
  6. Repeat until max iterations or threshold met

Outputs

output_dir/
├── hyperparameter_search/    # Phase 1 results
│   ├── optuna_study.db       # Optuna database (view with optuna-dashboard)
│   ├── best_config.json      # Optimal hyperparameters
│   └── trial_N/              # Each trial's smoke run
├── checkpoints/
│   └── iteration_N/          # Each iteration's checkpoint
├── evaluations/
│   └── iteration_N/          # Audio samples + ratings
│       ├── eval_00.wav
│       ├── eval_01.wav
│       └── evaluation.json
└── final_model/              # Best performing model (copied on completion)

Monitor Hyperparameter Search

bash
# View Optuna dashboard
optuna-dashboard sqlite:///artifacts/tts/<voice>_iterative/hyperparameter_search/optuna_study.db

Training (Qwen3-TTS)

Qwen3-TTS provides better quality and more natural speech. Requires proper audio_codes format:

bash
# Convert existing manifest to Qwen3-TTS format with audio_codes
.agent/skills/tts-train/run.sh convert-qwen3 \
  datasets/<voice>/train_manifest.jsonl \
  datasets/<voice>/train_manifest_qwen3.jsonl

# Train Qwen3-TTS model
.agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/<voice>_qwen3 \
  --epochs 5 --batch-size 8 --lr 1e-4

Audio Codes Format: Qwen3-TTS tokenizer returns audio_codes as List[torch.LongTensor] with shape [time_steps, 16_quantizers]. Extract with enc.audio_codes[0].cpu().tolist().

Training (Legacy XTTS-v2)

Note: XTTS-v2 support has been deprecated in favor of Qwen3-TTS, which provides superior voice quality and more natural speech patterns. The skill infrastructure for XTTS remains for backward compatibility but is no longer actively maintained.

For new voice training projects, please use the Qwen3-TTS workflow documented above.

Inference

bash
# Qwen3-TTS inference example
uv run python run/tts/server.py

TensorBoard (Auto)

bash
.agent/skills/tts-train/run.sh tensorboard 6006

Long Runs (Scheduler Skill)

For overnight runs, register a scheduler job so training survives terminal drops:

bash
.agent/skills/scheduler/run.sh register \
  --name "tts-train-<voice>" \
  --interval "12h" \
  --workdir "/home/graham/workspace/experiments/memory" \
  --command ".agent/skills/tts-train/run.sh train-qwen3 --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl --out-dir artifacts/tts/<voice>_qwen3_1.7b --epochs 10 --batch-size 1 --lr 2e-6 --gradient-accumulation-steps 8 --use-lora --lora-r 16 --lora-alpha 32 --use-8bit-adam --gradient-checkpointing | tee logs/tts/<voice>_train.log"

Troubleshooting

Qwen3-TTS 0.6B Model Training Support

Note: The official sft_12hz.py script requires patching to support the 0.6B conversational model (due to dimension mismatch).

Automated Solution: The cli.py tool automatically applies the required patch (001-fix-sft-optimizations.patch) when you run:

bash
.agent/skills/tts-train/cli.py ensure-repo

This patch includes:

  1. Text Projection Fix: Adds text_projection layer support for 0.6B models.
  2. Training Optimizations: Adds CLI args for max_steps, gradient_accumulation_steps, and weight_decay.
  3. Flash Attention Disable: Forces eager attention to avoid compatibility issues.

If you encounter issues, verify the patch status:

bash
cd third_party/Qwen3-TTS
git status  # Should show modified sft_12hz.py

Qwen3-TTS 1.7B Model Training (Advanced)

Memory Optimization for Large Models: The 1.7B model requires significant VRAM optimization to run on consumer GPUs (24GB VRAM).

Key Configuration:

bash
# 1.7B model with memory optimization
.agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/<voice>_qwen3_1.7b \
  --epochs 10 --batch-size 1 --lr 2e-6 \
  --gradient-accumulation-steps 8 \
  --use-lora --lora-r 16 --lora-alpha 32 \
  --use-8bit-adam --gradient-checkpointing

Memory Optimization Techniques:

  1. LoRA (PEFT): Reduces trainable parameters by ~90% while maintaining quality
  2. 8-bit AdamW: Cuts optimizer state memory by ~70%
  3. Gradient Checkpointing: Trades compute for memory (saves ~40% VRAM)
  4. Batch Size 1: Essential for 24GB VRAM systems with 12GB system overhead

Expected VRAM Usage: ~18GB total (vs ~32GB without optimizations)

Training Results: 9 epochs completed successfully, generating native-compatible checkpoints that preserve speaker identity for high-quality narration.

Model Type: 1.7B models are "custom_voice" models (not voice_clone), use generate_custom_voice() method for inference with trained speaker IDs.

Checkpoint Validation: Each epoch saves complete checkpoints with merged LoRA weights, ensuring native from_pretrained() compatibility without surgical splicing.

Hyperparameter Optimization (1.7B Advanced)

Critical for 1.7B Success: Use Bayesian optimization with web research to find optimal hyperparameters before full training.

Web-Enhanced Bayesian Tuning

bash
# 1.7B model with web research for optimal parameters
.agent/skills/tts-train/run.sh tune-1.7b-bayesian \
  --n_trials 15 \
  --n_smoke_steps 300 \
  --model_size 1.7b \
  --use_web_research \
  --dataset horus

# Monitor progress with Optuna dashboard
optuna-dashboard sqlite:///runs/horus/bayesian_tuning_1.7b/optuna_study.db

Web Research Integration:

  • Searches recent papers and community best practices
  • Validates parameters against known failure modes
  • Provides intelligent parameter initialization

Optimized Search Space for 1.7B:

  • Learning Rate: 1e-6 to 5e-5 (conservative for large models)
  • Batch Size: Fixed at 1 (memory constraint)
  • Gradient Accumulation: 8-32 (for effective batch size)
  • LoRA Config: r=[16,32], alpha=[32,64] (based on our successful training)
  • Warmup Steps: 500-2000 (prevents early instability)

Memory-Aware Tuning: Automatically applies 1.7B optimizations (LoRA, 8-bit Adam, gradient checkpointing) during tuning phase.

Traditional Bayesian Tuning (No Web Research)

bash
# Standard Bayesian optimization
python .agent/skills/tts-train/tune_qwen3_1.7b_bayesian.py \
  --n_trials 10 \
  --model_size 1.7b \
  --dataset horus

Expected Results: Finds optimal configuration in 10-15 trials vs 50+ grid search experiments.

Best Configuration Application

After tuning completes, use the optimal parameters for full training:

bash
# Load best config from tuning
BEST_CONFIG=$(cat artifacts/tts/qwen3_1.7b_bayesian/best_config.json | jq -r '.hyperparameters')

# Apply to full training
python .agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --data-manifest datasets/horus/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/horus_qwen3_1.7b_optimized \
  --lr $(echo $BEST_CONFIG | jq -r '.lr') \
  --gradient-accumulation-steps $(echo $BEST_CONFIG | jq -r '.gradient_accumulation_steps') \
  --weight-decay $(echo $BEST_CONFIG | jq -r '.weight_decay') \
  --lora-r $(echo $BEST_CONFIG | jq -r '.lora_r') \
  --lora-alpha $(echo $BEST_CONFIG | jq -r '.lora_alpha') \
  --warmup-steps $(echo $BEST_CONFIG | jq -r '.warmup_steps') \
  --use-lora --use-8bit-adam --gradient-checkpointing

Why This Matters:

CRITICAL WARNING: 1.7B model training WILL FAIL without proper hyperparameter tuning due to:

  • ❌ CUDA out of memory errors (improper batch/memory settings)
  • ❌ Poor voice quality (suboptimal learning rates)
  • ❌ Training instability (incorrect LoRA configurations)
  • ❌ 3-5x longer convergence time

With Tuning: Achieves optimal results in 10-15 trials vs 50+ manual experiments with 15-20% better final loss.

Inference and Model Usage

Qwen3-TTS Model Loading

1.7B Custom Voice Models (trained with this workflow):

python
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "artifacts/tts/<voice>_qwen3_1.7b/checkpoint-epoch-9",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate speech with trained speaker
audio_outputs, sample_rate = model.generate_custom_voice(
    text="Your text here",
    speaker="<voice_name>",  # Use the speaker name from training
    non_streaming_mode=True
)

Memory Requirements: 1.7B models require ~18GB VRAM for inference. Use device_map="auto" for automatic memory management.

Model Types:

  • Custom Voice: Trained models use generate_custom_voice() with speaker IDs
  • Voice Clone: Base models support generate_voice_clone() with reference audio

CUDA Library Error: libcudnn_ops_infer.so.8

If faster-whisper/WhisperX crashes with:

Could not load library libcudnn_ops_infer.so.8: cannot open shared object file

Fix: Set LD_LIBRARY_PATH before running:

bash
export LD_LIBRARY_PATH=$(python -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

Or use the Docker container with cuDNN pre-installed:

bash
docker run --rm -it --gpus all -v $(pwd):/workspace memory-tts:cu121 python ...

Too Many Short Segments (Low Clip Yield)

Whisper's VAD often produces very short segments (1-1.5s) that get rejected by the --min-sec 1.5 threshold. Merge adjacent segments before extraction:

python
# Merge segments with gap < 0.5s until 2-10s target duration
MIN_TARGET, MAX_TARGET, MAX_GAP = 2.0, 10.0, 0.5
merged = []
current = None
for seg in segments:
    if current is None:
        current = seg.copy()
        continue
    gap = seg['start'] - current['end']
    combined_dur = seg['end'] - current['start']
    if gap <= MAX_GAP and combined_dur <= MAX_TARGET and (current['end'] - current['start']) < MIN_TARGET:
        current['end'] = seg['end']
        current['text'] += ' ' + seg['text']
    else:
        merged.append(current)
        current = seg.copy()
if current:
    merged.append(current)

This typically reduces 44k segments → 22k with mean 2s duration, greatly improving yield.

Using Pre-existing Segments

Skip transcription if you already have segments:

bash
.agent/skills/tts-train/run.sh ingest \
  <audio.m4b> <voice_name> <output_dir> \
  --segments-jsonl existing_segments.jsonl

If you want a concise dataset report, emit a .batch_state.json and run batch-report:

bash
uv run python - <<'PY'
import json
from pathlib import Path

root = Path("datasets/<voice>")
manifest = root / "train_manifest.jsonl"
rejections = root / "rejections.jsonl"
total = sum(1 for _ in manifest.open()) if manifest.exists() else 0
rejected = sum(1 for _ in rejections.open()) if rejections.exists() else 0
state = {
  "name": "tts-dataset",
  "description": "TTS dataset build summary",
  "total": total + rejected,
  "successful": total,
  "failed": rejected,
}
(root / ".batch_state.json").write_text(json.dumps(state, indent=2))
PY

uv run python .agent/skills/batch-report/report.py summary datasets/<voice>

Didn't find tool you were looking for?

Be as detailed as possible for better results