Agent skill
asr-transcribe-to-text
Transcribe audio and video files to text using a remote ASR service (Qwen3-ASR or OpenAI-compatible endpoint). Extracts audio from video, sends to configurable ASR endpoint, outputs clean text. Use when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字, or has a meeting recording, lecture, interview, or screen recording to transcribe.
Install this agent skill to your Project
npx add-skill https://github.com/daymade/claude-code-skills/tree/main/asr-transcribe-to-text
SKILL.md
ASR Transcribe to Text
Transcribe audio/video files to text using a configurable ASR endpoint (default: Qwen3-ASR-1.7B via vLLM). Configuration persists across sessions in ${CLAUDE_PLUGIN_DATA}/config.json.
Step 0: Load or Initialize Configuration
cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null
If config exists, read the values and proceed to Step 1.
If config does not exist (first run), use AskUserQuestion:
First-time setup for ASR transcription.
I need to know where your ASR service is running so I can send audio to it.
RECOMMENDATION: Use the defaults below if you have Qwen3-ASR on a 4090 via Tailscale.
Q1: ASR Endpoint URL?
A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Default — Qwen3-ASR vLLM via Tailscale)
B) http://localhost:8002/v1/audio/transcriptions (Local machine)
C) Let me enter a custom URL
Q2: Does your network have an HTTP proxy that might intercept LAN/Tailscale traffic?
A) Yes — add --noproxy to bypass it (Recommended if you use Shadowrocket/Clash/corporate proxy)
B) No — direct connection is fine
Save the config:
mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
'endpoint': 'USER_PROVIDED_ENDPOINT',
'model': 'USER_PROVIDED_MODEL_OR_DEFAULT',
'noproxy': True, # or False based on user answer
'max_timeout': 900
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
json.dump(config, f, indent=2)
print('Config saved.')
"
Step 1: Validate Input and Check Service Health
Read config and health-check in a single command (shell variables don't persist across Bash calls):
python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
print(f'HEALTH CHECK FAILED', file=sys.stderr)
print(f'Endpoint: {base}/models', file=sys.stderr)
print(f'stdout: {result.stdout[:200]}', file=sys.stderr)
print(f'stderr: {result.stderr[:200]}', file=sys.stderr)
sys.exit(1)
else:
print(f'Service healthy: {base}')
print(f'Model: {cfg[\"model\"]}')
"
If health check fails, use AskUserQuestion:
ASR service at [endpoint] is not responding.
Options:
A) Diagnose — check network, Tailscale, and service status step by step
B) Reconfigure — the endpoint URL might be wrong, let me re-enter it
C) Try anyway — send the transcription request and see what happens
D) Abort — I'll fix the service manually and come back later
For option A, diagnose in order:
- Network:
ping -c 1 HOSTortailscale status | grep HOST - Service:
tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models" - Proxy: retry with
--noproxy '*'toggled
Step 2: Extract Audio (if input is video)
For video files (mp4, mov, mkv, avi, webm), extract audio as 16kHz mono MP3:
ffmpeg -i INPUT_VIDEO -vn -acodec libmp3lame -q:a 4 -ar 16000 -ac 1 OUTPUT.mp3 -y
For audio files (mp3, wav, m4a, flac, ogg), use directly — no conversion needed.
Get duration for progress estimation:
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE
Step 3: Transcribe — Single Request First
Always try full-length single request first. Chunking causes sentence truncation at every split boundary — the model forces the last sentence to close and loses words. Single request = zero truncation + fastest speed.
The Qwen3-ASR paper's "20-minute limit" is a training benchmark, not an inference hard limit. Empirically verified: 55 minutes transcribed in a single 76-second request on 4090 24GB.
python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH' # replace with actual path
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')
result = subprocess.run(
['curl', '-s', '--max-time', timeout] + noproxy + [
cfg['endpoint'],
'-F', f'file=@{audio_file}',
'-F', f'model={cfg[\"model\"]}',
'-o', output_json
], capture_output=True, text=True
)
with open(output_json) as f:
data = json.load(f)
if 'text' not in data:
print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
sys.exit(1)
text = data['text']
duration = data.get('usage', {}).get('seconds', 0)
print(f'Transcribed: {len(text)} chars, {duration}s audio', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt
Performance reference: ~400 characters per minute for Chinese speech; rates vary by language. Qwen3-ASR supports 52 languages including Chinese dialects, English, Japanese, Korean, and more.
Step 4: Verify and Confirm Output
After transcription, verify quality:
- Confirm the response contains a
textfield (not an error message) - Check character count is plausible for the audio duration (~400 chars/min for Chinese)
- Show the user the first ~200 characters as a preview
If the output looks wrong (empty, garbled, or error), use AskUserQuestion:
Transcription may have an issue:
- Expected: ~[N] chars for [M] minutes of audio
- Got: [actual chars] chars
- Preview: "[first 100 chars...]"
Options:
A) Save as-is — the output looks fine to me
B) Retry with fallback — split into chunks and merge (handles long audio / OOM)
C) Reconfigure — try a different model or endpoint
D) Abort — something is wrong with the service
If output is good, save as .txt alongside the original file or to user-specified location.
Step 5: Fallback — Overlap-Merge for Very Long Audio
If single request fails (timeout, OOM, HTTP error), fall back to chunked transcription with overlap merging:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
--config "${CLAUDE_PLUGIN_DATA}/config.json" \
INPUT_AUDIO OUTPUT.txt
This splits into 18-minute chunks with 2-minute overlap, then merges using punctuation-stripped fuzzy matching. See references/overlap_merge_strategy.md for the algorithm details.
Reconfigure
To change the ASR endpoint, model, or proxy settings:
rm "${CLAUDE_PLUGIN_DATA}/config.json"
Then re-run Step 0 to collect new values via AskUserQuestion.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
excel-automation
Create, parse, and control Excel files on macOS. Professional formatting with openpyxl, complex xlsm parsing with stdlib zipfile+xml for investment bank financial models, and Excel window control via AppleScript. Use when creating formatted Excel reports, parsing financial models that openpyxl cannot handle, or automating Excel on macOS.
claude-code-history-files-finder
Finds and recovers content from Claude Code session history files. This skill should be used when searching for deleted files, tracking changes across sessions, analyzing conversation history, or recovering code from previous Claude interactions. Triggers include mentions of "session history", "recover deleted", "find in history", "previous conversation", or ".claude/projects".
claude-skills-troubleshooting
Diagnose and resolve Claude Code plugin and skill issues. This skill should be used when plugins are installed but not showing in available skills list, skills are not activating as expected, or when troubleshooting enabledPlugins configuration in settings.json. Triggers include "plugin not working", "skill not showing", "installed but disabled", or "enabledPlugins" issues.
fixing-claude-export-conversations
Fixes broken line wrapping in Claude Code exported conversation files (.txt), reconstructing tables, paragraphs, paths, and tool calls that were hard-wrapped at fixed column widths. Includes an automated validation suite (generic, file-agnostic checks). Triggers when the user has a Claude Code export file with broken formatting, mentions "fix export", "fix conversation", "exported conversation", "make export readable", references a file matching YYYY-MM-DD-HHMMSS-*.txt, or has a .txt file with broken tables, split paths, or mangled tool output from Claude Code.
continue-claude-work
Recover actionable context from local `.claude` session artifacts and continue interrupted work without running `claude --resume`. This skill should be used when the user provides a Claude session ID, asks to continue prior work from local history, or wants to inspect `.claude` files before resuming implementation.
promptfoo-evaluation
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
Didn't find tool you were looking for?