Agent skill

multi-modal

Multi-modal prompting with vision, audio, and document understanding

Stars 2
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-prompt-engineering/tree/main/skills/multi-modal

SKILL.md

Multi-Modal Prompting Skill

Bonded to: advanced-techniques-agent


Quick Start

bash
Skill("custom-plugin-prompt-engineering:multi-modal")

Parameter Schema

yaml
parameters:
  modality:
    type: enum
    values: [vision, document, audio, video]
    required: true

  task_type:
    type: enum
    values: [analysis, extraction, generation, qa]
    default: analysis

  detail_level:
    type: enum
    values: [low, medium, high]
    default: medium

Vision Prompting

Image Analysis Template

markdown
Analyze this image and provide:
1. Main subjects and objects
2. Actions or activities
3. Setting and context
4. Notable details
5. Overall interpretation

Be specific and descriptive.

Visual Q&A Pattern

markdown
Look at the image carefully.

Question: {question}

Provide a detailed answer based only on what you can see in the image.

Chart/Graph Analysis

markdown
Analyze this chart/graph:
1. Type of visualization
2. Axes and labels
3. Key data points
4. Trends or patterns
5. Main insights
6. Limitations or caveats

Document Processing

PDF Extraction

markdown
Extract the following from this document:
- Title and headers
- Key information: {fields}
- Tables (if any)
- Important dates/numbers

Output as structured JSON.

Form/Invoice Processing

yaml
extraction_schema:
  document_type: "invoice|form|contract"
  fields:
    - name: vendor
      type: string
    - name: date
      type: date
    - name: total
      type: currency
    - name: line_items
      type: array

Audio Integration

Transcription Enhancement

markdown
Transcribe and enhance:
1. Accurate transcription
2. Speaker identification
3. Timestamps for key points
4. Summary of main topics
5. Action items (if applicable)

Best Practices

yaml
best_practices:
  image_prompts:
    - Be specific about what to look for
    - Request structured output
    - Ask for confidence levels

  document_prompts:
    - Define extraction schema
    - Handle multi-page documents
    - Validate extracted data

  audio_prompts:
    - Specify language if known
    - Request speaker diarization
    - Ask for timestamps

Troubleshooting

Issue Cause Solution
Hallucinated details Over-interpretation Ask for visible-only info
Missed text in images Low resolution Request higher detail
Wrong document parsing Complex layout Break into sections
Inaccurate transcription Audio quality Acknowledge limitations

References

See: GPT-4V documentation, Claude Vision Guide

Didn't find tool you were looking for?

Be as detailed as possible for better results