Agent skill
video-transcriber
Transcribe audio from videos using Whisper (local) or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/video-transcriber
Metadata
Additional technical details for this skill
- models
- whisper, gemini-flash-lite-latest
- version
- 1.0
SKILL.md
Video Transcriber
This skill enables AI agents to transcribe audio from video files using either Whisper (local processing) or Gemini API (cloud processing with advanced features).
When to Use
- User wants to transcribe a video or audio file
- User needs subtitles/captions for a video
- User wants to analyze video content through transcription
- User needs to identify viral-worthy segments
- User wants speaker diarization or emotion detection
Model Selection
Whisper (Local)
Pros:
- Free to use
- 100% privacy (no cloud upload)
- Good for sensitive content
- Lower cost for high volume
Cons:
- Requires local processing power
- No built-in speaker diarization
- No emotion detection
- Limited to 99 languages
Models:
tiny- Fastest, lower accuracy (~32MB)base- Fast, good accuracy (~74MB)small- Balanced speed/accuracy (~244MB)medium- Good accuracy, slower (~769MB)large-v3- Highest accuracy, slowest (~1550MB)
Gemini API (Cloud)
Pros:
- High accuracy with gemini-flash-lite-latest
- Built-in speaker diarization
- Emotion detection from speech
- Context understanding
- Can identify viral segments
- 125+ language support
- Faster processing (cloud-based)
Cons:
- Requires API key
- Cloud upload (privacy consideration)
- Cost per usage
- Internet required
Available Scripts
scripts/transcribe.py
Transcribe audio from video file.
Usage:
python skills/video-transcriber/scripts/transcribe.py <video_path> [options]
Options:
--model, -m: Model to use (whisper, gemini) - default: auto--whisper-model: Whisper model size (tiny, base, small, medium, large-v3) - default: medium--use-faster: Use faster-whisper for speed - default: True--output, -o: Output file path (default:<video_path>.srt)--format: Output format (srt, vtt, json) - default: srt--language: Language code (e.g., en, id) - default: auto--speaker-diarization: Enable speaker labels (Gemini only)--emotion-detection: Enable emotion detection (Gemini only)--device: Device for Whisper (auto, cpu, cuda) - default: auto
Examples:
Transcribe with Whisper (default):
python skills/video-transcriber/scripts/transcribe.py video.mp4
Transcribe with Gemini API:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini
Transcribe with speaker diarization and emotion detection (Gemini):
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini --speaker-diarization --emotion-detection
Transcribe with large Whisper model:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --whisper-model large-v3
Output to JSON:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --format json
scripts/analyze.py
Analyze audio content using Gemini API for viral segments, summary, or emotions.
Usage:
python skills/video-transcriber/scripts/analyze.py <video_path> [options]
Options:
--analysis-type: Type of analysis (viral, summary, emotions, questions) - default: viral--num-segments: Number of segments to identify (for viral analysis) - default: 5--model: Model to use (default: gemini)
Examples:
Detect viral segments:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type viral
Get summary:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type summary
Analyze emotions:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type emotions
Output Format
SRT Format
1
00:00:00,000 --> 00:00:05,000
This is the first subtitle.
2
00:00:05,500 --> 00:00:10,000
This is the second subtitle.
JSON Format
[
{
"index": 1,
"start": 0.0,
"end": 5.0,
"text": "This is the first subtitle.",
"speaker": "Speaker A",
"emotion": "neutral"
}
]
Auto Selection Logic
When --model auto, the system selects based on:
- Privacy priority: Always use Whisper
- Quality needed: Use gemini for highest quality
- Content length: Use faster-whisper for long content (> 1 hour)
- Feature requirements: Use gemini if speaker diarization or emotion detection needed
- Default: Use gemini-flash-lite-latest
Environment Variables
# For Gemini API
export GEMINI_API_KEY="your-api-key"
# Optional: For Vertex AI
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1"
Integration with Other Skills
After transcription, you can use these skills:
highlight-scanner: Analyze transcript for viral momentssubtitle-overlay: Add captions to videoautocut-shorts: Full workflow for creating short clips
Common Workflow
- User provides video file or URL
- Download if needed (youtube-downloader)
- Transcribe using this skill
- Analyze transcript for highlights (highlight-scanner)
- Create short clips (autocut-shorts)
Tips
- Use
--use-fasterwith Whisper for faster processing - Use Gemini when you need speaker diarization
- Use
--format jsonfor programmatic processing - For long videos, consider splitting into segments
- Use
--analysis-type viralto identify best segments for short-form content
References
- Whisper documentation: https://github.com/openai/whisper
- Gemini API: https://ai.google.dev/gemini-api/docs/audio
- Language codes: ISO 639-1 codes (en, id, es, etc.)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?