Agent skill

media-understanding

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/marketing/media-understanding

SKILL.md

Media Understanding

Audio Files → faster-whisper (local)

For mp3, wav, m4a, flac, ogg, aac files:

bash
faster-whisper "path/to/audio.mp3" -o /tmp --model large-v3

Options

Option Description
-o DIR Output directory for .srt file
--model SIZE tiny, base, small, medium, large-v3 (default: large-v3)
--language LANG Force language (auto-detected by default)
--task transcribe Transcribe in original language (default)
--task translate Translate to English
--word_timestamps true Include word-level timing

Output: SRT subtitle file in output directory.

Video Files → Gemini (visual + audio)

For mp4, mov, webm, avi, mkv files or YouTube URLs:

bash
uv run ~/.claude/skills/media-understanding/scripts/understand_video.py \
  --source "path/to/video.mp4" \
  --prompt "Describe what happens in this video"

Options

Option Description
--fast Use faster flash model
--fps N Frame rate sampling (default: 1 fps)
--start N Start time in seconds
--end N End time in seconds

Example Prompts

  • "Summarize this video in 3 bullet points"
  • "Transcribe all spoken dialogue with timestamps"
  • "What text appears on screen?"
  • "Describe the main actions and events"

API Key

Gemini requires GEMINI_API_KEY env var.

Didn't find tool you were looking for?

Be as detailed as possible for better results