Agent skill

multimodal-llm

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling, Sora, Veo, Runway), or building multimodal AI pipelines.

View SKILL.md on GitHub Repository

Stars 143

Forks 15

Install this agent skill to your Project

npx add-skill https://github.com/yonatangross/orchestkit/tree/main/src/skills/multimodal-llm

Metadata

Additional technical details for this skill

category: mcp-enhancement

SKILL.md

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).

Quick Reference

Category	Rules	Impact	When to Use
Vision: Image Analysis	1	HIGH	Image captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding	1	HIGH	OCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection	1	MEDIUM	Choosing provider, cost optimization, image size limits
Audio: Speech-to-Text	1	HIGH	Transcription, speaker diarization, long-form audio
Audio: Text-to-Speech	1	MEDIUM	Voice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection	1	MEDIUM	Real-time voice agents, provider comparison, pricing
Video: Model Selection	1	HIGH	Choosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns	1	HIGH	Async task polling, SDK integration, webhook callbacks
Video: Multi-Shot	1	HIGH	Storyboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule	File	Key Pattern
Image Analysis	`rules/vision-image-analysis.md`	Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule	File	Key Pattern
Document Vision	`rules/vision-document.md`	PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule	File	Key Pattern
Vision Models	`rules/vision-models.md`	Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule	File	Key Pattern
Speech-to-Text	`rules/audio-speech-to-text.md`	Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule	File	Key Pattern
Text-to-Speech	`rules/audio-text-to-speech.md`	Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule	File	Key Pattern
Audio Models	`rules/audio-models.md`	Real-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

Rule	File	Key Pattern
Video Models	`rules/video-generation-models.md`	Kling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

Rule	File	Key Pattern
API Integration	`rules/video-generation-patterns.md`	Kling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

Rule	File	Key Pattern
Multi-Shot	`rules/video-multi-shot.md`	Kling 3.0 character elements, 6-shot storyboards, identity binding

Key Decisions

Decision	Recommendation
High accuracy vision	Claude Opus 4.6 or GPT-5
Long documents	Gemini 2.5 Pro (1M context)
Cost-efficient vision	Gemini 2.5 Flash ($0.15/M tokens)
Video analysis	Gemini 2.5/3 Pro (native video)
Voice assistant	Grok Voice Agent (fastest, <1s)
Emotional voice AI	Gemini Live API
Long audio transcription	Gemini 2.5 Pro (9.5hr)
Speaker diarization	AssemblyAI or Gemini
Self-hosted STT	Whisper Large V3
Character-consistent video	Kling 3.0 (Character Elements 3.0)
Narrative video / storytelling	Sora 2 (best cause-and-effect coherence)
Cinematic B-roll	Veo 3.1 (camera control + polished motion)
Professional VFX	Runway Gen-4.5 (Act-Two motion transfer)
High-volume social video	Kling 3.0 Standard ($0.20/video)
Open-source video gen	Wan 2.6 or LTX-2
Lip-sync / avatar video	Kling 3.0 (native lip-sync API)

Example

python

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

Not setting max_tokens on vision requests (responses truncated)
Sending oversized images without resizing (>2048px)
Using high detail level for simple yes/no classification
Using STT+LLM+TTS pipeline instead of native speech-to-speech
Not leveraging barge-in support for natural voice conversations
Using deprecated models (GPT-4V, Whisper-1)
Ignoring rate limits on vision and audio endpoints
Calling video generation APIs synchronously (they're async — poll or use callbacks)
Generating separate clips without character elements (characters look different each time)
Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

ork:rag-retrieval - Multimodal RAG with image + text retrieval
ork:llm-integration - General LLM function calling patterns
streaming-api-patterns - WebSocket patterns for real-time audio
ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen

Maintainer

yonatangross Core maintainer

Source details

Full Name: yonatangross/orchestkit
Branch: main
Path in repo: src/skills/multimodal-llm
License: MIT License
Topics: claude-code mcp typescript agents llm react ai-development security rag langgraph testing claude-plugin fastapi

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

yonatangross/orchestkit

expect

Diff-aware AI browser testing — analyzes git changes, generates targeted test plans, and executes them via agent-browser. Reads git diff to determine what changed, maps changes to affected pages via route map, generates a test plan scoped to the diff, and runs it with pass/fail reporting. Use when testing UI changes, verifying PRs before merge, running regression checks on changed components, or validating that recent code changes don't break the user-facing experience.

143 15

Explore

yonatangross/orchestkit

github-operations

GitHub CLI operations for issues, PRs, milestones, and Projects v2. Covers gh commands, REST API patterns, and automation scripts. Use when managing GitHub issues, PRs, milestones, or Projects with gh.

143 15

Explore

yonatangross/orchestkit

chain-patterns

Chain patterns for CC 2.1.71 pipelines — MCP detection, handoff files, checkpoint-resume, worktree agents, CronCreate monitoring. Use when building multi-phase pipeline skills. Loaded via skills: field by pipeline skills (fix-issue, implement, brainstorm, verify). Not user-invocable.

143 15

Explore

yonatangross/orchestkit

storybook-mcp-integration

Storybook MCP server integration for component-aware AI development. Covers 6 tools across 3 toolsets (dev, docs, testing): component discovery via list-all-documentation/get-documentation, story previews via preview-stories, and automated testing via run-story-tests. Use when generating components that should reuse existing Storybook components, running component tests via MCP, or previewing stories in chat.

143 15

Explore

yonatangross/orchestkit

component-search

Search 21st.dev component registry for production-ready React components. Finds components by natural language description, filters by framework and style system, returns ranked results with install instructions. Use when looking for UI components, finding alternatives to existing components, or sourcing design system building blocks.

143 15

Explore

yonatangross/orchestkit

ai-ui-generation

AI-assisted UI generation patterns for json-render, v0, Bolt, and Cursor workflows. Covers prompt engineering for component generation, review checklists for AI-generated code, design token injection, refactoring for design system conformance, and CI gates for quality assurance. Use when generating UI components with AI tools, rendering multi-surface MCP visual output, reviewing AI-generated code, or integrating AI output into design systems.

143 15

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Multimodal LLM Patterns

Quick Reference

Vision: Image Analysis

Vision: Document Understanding

Vision: Model Selection

Audio: Speech-to-Text

Audio: Text-to-Speech

Audio: Model Selection

Video: Model Selection

Video: API Patterns

Video: Multi-Shot

Key Decisions

Example

Common Mistakes

Related Skills

Recommended Agent Skills

expect

github-operations

chain-patterns

storybook-mcp-integration

component-search

ai-ui-generation