Agent skill
livekit-stt-selfhosted
Build self-hosted speech-to-text APIs using Hugging Face models (Whisper, Wav2Vec2) and create LiveKit voice agent plugins. Use when building STT infrastructure, creating custom LiveKit plugins, deploying self-hosted transcription services, or integrating Whisper/HF models with LiveKit agents. Includes FastAPI server templates, LiveKit plugin implementation, model selection guides, and production deployment patterns.
Install this agent skill to your Project
npx add-skill https://github.com/Okeysir198/P20251122-claude-skills/tree/main/.claude/skills/livekit-stt-selfhosted
SKILL.md
LiveKit Self-Hosted STT Plugin
Build self-hosted speech-to-text APIs and LiveKit voice agent plugins using Hugging Face models.
Overview
This skill provides templates and guidance for:
- Building a self-hosted STT API server using FastAPI + Whisper/HF models
- Creating a LiveKit plugin that connects to your self-hosted API
- Deploying and scaling in production
Quick Start
Option 1: Build Both (API + Plugin)
When user wants complete setup:
- Create API Server:
python scripts/setup_api_server.py my-stt-server --model openai/whisper-medium
cd my-stt-server
pip install -r requirements.txt
python main.py
- Create Plugin:
python scripts/setup_plugin.py custom-stt
cd livekit-plugins-custom-stt
pip install -e .
- Use in LiveKit Agent:
from livekit.plugins import custom_stt
stt=custom_stt.STT(api_url="ws://localhost:8000/ws/transcribe")
Option 2: API Server Only
When user only needs the API server:
- Use
scripts/setup_api_server.pywith desired model - See
references/api_server_guide.mdfor implementation details - Template in
assets/api-server/
Option 3: Plugin Only
When user has existing API and needs LiveKit plugin:
- Use
scripts/setup_plugin.pywith plugin name - See
references/plugin_implementation.mdfor details - Template in
assets/plugin-template/
Model Selection
Help user choose the right model:
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Best accuracy | openai/whisper-large-v3 |
SOTA quality, requires GPU |
| Production balance | openai/whisper-medium |
Good quality, reasonable speed |
| Real-time/fast | openai/whisper-small |
Fast, acceptable quality |
| CPU-only | openai/whisper-tiny |
Can run without GPU |
| English-only | facebook/wav2vec2-large-960h |
Optimized for English |
For detailed comparison and optimization tips, see references/models_comparison.md.
Implementation Workflow
Building the API Server
-
Use the template: Start with
assets/api-server/main.py -
Key components:
- FastAPI app with WebSocket endpoint
- Model loading at startup (kept in memory)
- Audio buffer management
- WebSocket protocol for streaming
-
Customization points:
- Model selection (change
MODEL_IDin .env) - Audio processing parameters
- Batch size and optimization
- Error handling
- Model selection (change
For complete implementation guide, see references/api_server_guide.md.
Building the LiveKit Plugin
-
Use the template: Start with
assets/plugin-template/ -
Required implementations:
_recognize_impl()- Non-streaming recognitionstream()- Return SpeechStream instanceSpeechStreamclass - Handle streaming
-
Key considerations:
- Audio format conversion (16kHz, mono, 16-bit PCM)
- WebSocket connection management
- Event emission (interim/final transcripts)
- Error handling and cleanup
For complete implementation guide, see references/plugin_implementation.md.
Deployment
Development
# API Server
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# Test WebSocket
ws://localhost:8000/ws/transcribe
Production
Docker (Recommended):
docker-compose up
Kubernetes: Use manifests in deployment guide
Cloud Platforms: AWS ECS, GCP Cloud Run, Azure Container Instances
For complete deployment guide including scaling, monitoring, and security, see references/deployment.md.
WebSocket Protocol
Client → Server
- Audio: Binary (16-bit PCM, 16kHz)
- Config:
{"type": "config", "language": "en"} - End:
{"type": "end"}
Server → Client
- Interim:
{"type": "interim", "text": "..."} - Final:
{"type": "final", "text": "...", "language": "en"} - Error:
{"type": "error", "message": "..."}
Common Tasks
Change Model
Edit .env:
MODEL_ID=openai/whisper-small # Faster model
Add Language Support
In plugin usage:
stt=custom_stt.STT(language="es") # Spanish
stt=custom_stt.STT(detect_language=True) # Auto-detect
Enable GPU
In API server:
DEVICE=cuda:0 # Use GPU
Scale Horizontally
Deploy multiple API server instances behind load balancer. See references/deployment.md for Nginx configuration.
Troubleshooting
Out of Memory
- Use smaller model (
whisper-smallorwhisper-tiny) - Reduce
batch_sizein pipeline - Enable
low_cpu_mem_usage=True
Slow Transcription
- Ensure GPU is enabled (
DEVICE=cuda:0) - Use FP16 precision (automatic on GPU)
- Increase
batch_size - Use smaller model
Connection Issues
- Verify WebSocket support in load balancer
- Check firewall rules
- Increase timeout settings
Scripts
scripts/setup_api_server.py- Generate API server from templatescripts/setup_plugin.py- Generate LiveKit plugin from template
References
Load these as needed for detailed information:
references/api_server_guide.md- Complete API implementation guidereferences/plugin_implementation.md- LiveKit plugin developmentreferences/models_comparison.md- Model selection and optimizationreferences/deployment.md- Production deployment best practices
Assets
Ready-to-use templates:
assets/api-server/- Complete FastAPI server with Whisperassets/plugin-template/- LiveKit STT plugin structure
Best Practices
- Keep models in memory - Load once at startup, not per request
- Use appropriate model size - Balance quality vs. speed for your use case
- Process audio in chunks - 1-second chunks work well for streaming
- Implement proper cleanup - Close WebSocket connections gracefully
- Monitor metrics - Track latency, throughput, GPU utilization
- Use Docker - Ensures consistent deployments
- Enable authentication - Secure production APIs
- Scale horizontally - Use load balancer for high availability
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
mcp-builder
Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK).
canvas-design
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.
skill-creator
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
livekit-nextjs-frontend
Build and review production-grade web and mobile frontends using LiveKit with Next.js. Covers real-time video/audio/data communication, WebRTC connections, track management, and best practices for LiveKit React components.
livekit-agent-tools
Comprehensive guide for building functional tools for LiveKit voice agents using the @function_tool decorator. Use when creating tools for LiveKit agents to enable capabilities like API calls, database queries, multi-agent coordination, or any external integrations. Covers tool design, RunContext handling, interruption patterns, parameter documentation, testing, and production best practices.
webapp-testing
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
Didn't find tool you were looking for?