Agent skill

transcription

Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

View SKILL.md on GitHub Repository

Stars 248

Forks 27

Install this agent skill to your Project

npx add-skill https://github.com/MadAppGang/claude-code/tree/main/plugins/video-editing/skills/transcription

SKILL.md

plugin: video-editing updated: 2026-01-20

Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

System Requirements

Installation Options

Option 1: OpenAI Whisper (Python)

bash

# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help

Option 2: whisper.cpp (C++ - faster)

bash

# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake

Option 3: Insanely Fast Whisper (GPU accelerated)

bash

pip install insanely-fast-whisper

Model Selection

Model	Size	VRAM	Accuracy	Speed	Use Case
tiny	39M	~1GB	Low	Fastest	Quick previews
base	74M	~1GB	Medium	Fast	Draft transcripts
small	244M	~2GB	Good	Medium	General use
medium	769M	~5GB	Better	Slow	Quality transcripts
large-v3	1550M	~10GB	Best	Slowest	Final production

Recommendation: Start with small for speed/quality balance. Use large-v3 for final delivery.

Basic Transcription

Using OpenAI Whisper

bash

# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True

Using whisper.cpp

bash

# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv

Output Formats

SRT (SubRip Subtitle)

1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.

VTT (WebVTT)

WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.

JSON (with word-level timing)

json

{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}

Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

bash

# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3

Timing Synchronization

Convert Whisper JSON to FCP Timing

python

import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments

Frame-Accurate Timing

bash

# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4

Speaker Diarization

For multi-speaker content, use pyannote.audio:

bash

pip install pyannote.audio

python

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

Batch Processing

bash

#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done

Quality Optimization

Improve Accuracy

Noise reduction before transcription:

bash

ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav

Use language hint:

bash

whisper audio.mp3 --language en --model medium

Provide initial prompt for context:

bash

whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."

Performance Tips

GPU acceleration (if available):

bash

whisper audio.mp3 --model large-v3 --device cuda

Process in chunks for long videos:

python

# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment

Error Handling

bash

# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}

Related Skills

ffmpeg-core - Audio extraction and preprocessing
final-cut-pro - Import transcripts as titles/markers

Maintainer

MadAppGang Core maintainer

Source details

Full Name: MadAppGang/claude-code
Branch: main
Path in repo: plugins/video-editing/skills/transcription
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

MadAppGang/claude-code

test-skill

A test skill for validation testing. Use when testing skill parsing and validation logic.

248 27

Explore

MadAppGang/claude-code

bad-skill

248 27

Explore

MadAppGang/claude-code

claudish-usage

CRITICAL - Guide for using Claudish CLI ONLY through sub-agents to run Claude Code with OpenRouter models (Grok, GPT-5, Gemini, MiniMax). NEVER run Claudish directly in main context unless user explicitly requests it. Use when user mentions external AI models, Claudish, OpenRouter, or alternative models. Includes mandatory sub-agent delegation patterns, agent selection guide, file-based instructions, and strict rules to prevent context window pollution.

248 27

Explore

MadAppGang/claude-code

release

Plugin release process for MAG Claude Plugins marketplace. Covers version bumping, marketplace.json updates, git tagging, and common mistakes. Use when releasing new plugin versions or troubleshooting update issues.

248 27

Explore

MadAppGang/claude-code

claudish-integration

248 27

Explore

MadAppGang/claude-code

openrouter-trending-models

Fetch trending programming models from OpenRouter rankings. Use when selecting models for multi-model review, updating model recommendations, or researching current AI coding trends. Provides model IDs, context windows, pricing, and usage statistics from the most recent week.

248 27

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Transcription with Whisper

System Requirements

Installation Options

Model Selection

Basic Transcription

Using OpenAI Whisper

Using whisper.cpp

Output Formats

SRT (SubRip Subtitle)

VTT (WebVTT)

JSON (with word-level timing)

Audio Extraction for Transcription

Timing Synchronization

Convert Whisper JSON to FCP Timing

Frame-Accurate Timing

Speaker Diarization

Batch Processing

Quality Optimization

Improve Accuracy

Performance Tips

Error Handling

Related Skills

Recommended Agent Skills

test-skill

bad-skill

claudish-usage

release

claudish-integration

openrouter-trending-models