Agent skill

multimodal-llm

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling, Sora, Veo, Runway), or building multimodal AI pipelines.

Stars 143
Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/yonatangross/orchestkit/tree/main/src/skills/multimodal-llm

Metadata

Additional technical details for this skill

category
mcp-enhancement

SKILL.md

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).

Quick Reference

Category Rules Impact When to Use
Vision: Image Analysis 1 HIGH Image captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding 1 HIGH OCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection 1 MEDIUM Choosing provider, cost optimization, image size limits
Audio: Speech-to-Text 1 HIGH Transcription, speaker diarization, long-form audio
Audio: Text-to-Speech 1 MEDIUM Voice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection 1 MEDIUM Real-time voice agents, provider comparison, pricing
Video: Model Selection 1 HIGH Choosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns 1 HIGH Async task polling, SDK integration, webhook callbacks
Video: Multi-Shot 1 HIGH Storyboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule File Key Pattern
Image Analysis rules/vision-image-analysis.md Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule File Key Pattern
Document Vision rules/vision-document.md PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule File Key Pattern
Vision Models rules/vision-models.md Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule File Key Pattern
Speech-to-Text rules/audio-speech-to-text.md Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule File Key Pattern
Text-to-Speech rules/audio-text-to-speech.md Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule File Key Pattern
Audio Models rules/audio-models.md Real-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

Rule File Key Pattern
Video Models rules/video-generation-models.md Kling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

Rule File Key Pattern
API Integration rules/video-generation-patterns.md Kling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

Rule File Key Pattern
Multi-Shot rules/video-multi-shot.md Kling 3.0 character elements, 6-shot storyboards, identity binding

Key Decisions

Decision Recommendation
High accuracy vision Claude Opus 4.6 or GPT-5
Long documents Gemini 2.5 Pro (1M context)
Cost-efficient vision Gemini 2.5 Flash ($0.15/M tokens)
Video analysis Gemini 2.5/3 Pro (native video)
Voice assistant Grok Voice Agent (fastest, <1s)
Emotional voice AI Gemini Live API
Long audio transcription Gemini 2.5 Pro (9.5hr)
Speaker diarization AssemblyAI or Gemini
Self-hosted STT Whisper Large V3
Character-consistent video Kling 3.0 (Character Elements 3.0)
Narrative video / storytelling Sora 2 (best cause-and-effect coherence)
Cinematic B-roll Veo 3.1 (camera control + polished motion)
Professional VFX Runway Gen-4.5 (Act-Two motion transfer)
High-volume social video Kling 3.0 Standard ($0.20/video)
Open-source video gen Wan 2.6 or LTX-2
Lip-sync / avatar video Kling 3.0 (native lip-sync API)

Example

python
import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

  1. Not setting max_tokens on vision requests (responses truncated)
  2. Sending oversized images without resizing (>2048px)
  3. Using high detail level for simple yes/no classification
  4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
  5. Not leveraging barge-in support for natural voice conversations
  6. Using deprecated models (GPT-4V, Whisper-1)
  7. Ignoring rate limits on vision and audio endpoints
  8. Calling video generation APIs synchronously (they're async — poll or use callbacks)
  9. Generating separate clips without character elements (characters look different each time)
  10. Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

  • ork:rag-retrieval - Multimodal RAG with image + text retrieval
  • ork:llm-integration - General LLM function calling patterns
  • streaming-api-patterns - WebSocket patterns for real-time audio
  • ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen

Expand your agent's capabilities with these related and highly-rated skills.

yonatangross/orchestkit

expect

Diff-aware AI browser testing — analyzes git changes, generates targeted test plans, and executes them via agent-browser. Reads git diff to determine what changed, maps changes to affected pages via route map, generates a test plan scoped to the diff, and runs it with pass/fail reporting. Use when testing UI changes, verifying PRs before merge, running regression checks on changed components, or validating that recent code changes don't break the user-facing experience.

143 15
Explore
yonatangross/orchestkit

github-operations

GitHub CLI operations for issues, PRs, milestones, and Projects v2. Covers gh commands, REST API patterns, and automation scripts. Use when managing GitHub issues, PRs, milestones, or Projects with gh.

143 15
Explore
yonatangross/orchestkit

chain-patterns

Chain patterns for CC 2.1.71 pipelines — MCP detection, handoff files, checkpoint-resume, worktree agents, CronCreate monitoring. Use when building multi-phase pipeline skills. Loaded via skills: field by pipeline skills (fix-issue, implement, brainstorm, verify). Not user-invocable.

143 15
Explore
yonatangross/orchestkit

storybook-mcp-integration

Storybook MCP server integration for component-aware AI development. Covers 6 tools across 3 toolsets (dev, docs, testing): component discovery via list-all-documentation/get-documentation, story previews via preview-stories, and automated testing via run-story-tests. Use when generating components that should reuse existing Storybook components, running component tests via MCP, or previewing stories in chat.

143 15
Explore
yonatangross/orchestkit

component-search

Search 21st.dev component registry for production-ready React components. Finds components by natural language description, filters by framework and style system, returns ranked results with install instructions. Use when looking for UI components, finding alternatives to existing components, or sourcing design system building blocks.

143 15
Explore
yonatangross/orchestkit

ai-ui-generation

AI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.

143 15
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results