Agent skill

groq-core-workflow-b

Execute Groq secondary workflows: audio transcription (Whisper), vision, text-to-speech, and batch model evaluation. Trigger with phrases like "groq whisper", "groq transcription", "groq audio", "groq vision", "groq TTS", "groq speech".

Stars 1,803
Forks 241

Install this agent skill to your Project

npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/groq-pack/skills/groq-core-workflow-b

SKILL.md

Groq Core Workflow B: Audio, Vision & Speech

Overview

Beyond chat completions, Groq provides ultra-fast audio transcription (Whisper at 216x real-time), multimodal vision (Llama 4 Scout/Maverick), and text-to-speech. These endpoints use the same groq-sdk client.

Prerequisites

  • groq-sdk installed, GROQ_API_KEY set
  • For audio: audio files in supported formats
  • For vision: image URLs or base64 images

Audio Models

Model ID Languages Speed Best For
whisper-large-v3 100+ 164x real-time Best accuracy, multilingual
whisper-large-v3-turbo 100+ 216x real-time Best speed/accuracy balance

Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm

Instructions

Step 1: Audio Transcription (Whisper)

typescript
import Groq from "groq-sdk";
import fs from "fs";

const groq = new Groq();

// Transcribe audio file
async function transcribe(filePath: string): Promise<string> {
  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3-turbo",
    response_format: "json",        // or "text" or "verbose_json"
    language: "en",                  // Optional: ISO 639-1 code
  });

  return transcription.text;
}

// With timestamps (verbose mode)
async function transcribeWithTimestamps(filePath: string) {
  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3-turbo",
    response_format: "verbose_json",
    timestamp_granularities: ["segment"],
  });

  return transcription;
  // Returns segments with start/end times
}

Step 2: Audio Translation (to English)

typescript
// Translate any language audio to English text
async function translateAudio(filePath: string): Promise<string> {
  const translation = await groq.audio.translations.create({
    file: fs.createReadStream(filePath),
    model: "whisper-large-v3",
  });

  return translation.text;
}

Step 3: Vision (Image Understanding)

typescript
// Analyze images with Llama 4 Scout (up to 5 images per request)
async function analyzeImage(imageUrl: string, question: string) {
  const completion = await groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: question },
          { type: "image_url", image_url: { url: imageUrl } },
        ],
      },
    ],
    max_tokens: 1024,
  });

  return completion.choices[0].message.content;
}

// Multiple images
async function compareImages(urls: string[], prompt: string) {
  const imageContent = urls.map((url) => ({
    type: "image_url" as const,
    image_url: { url },
  }));

  const completion = await groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [{
      role: "user",
      content: [{ type: "text", text: prompt }, ...imageContent],
    }],
    max_tokens: 2048,
  });

  return completion.choices[0].message.content;
}

// Base64 image input
async function analyzeBase64Image(base64Data: string) {
  return groq.chat.completions.create({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "Describe this image in detail." },
        {
          type: "image_url",
          image_url: { url: `data:image/jpeg;base64,${base64Data}` },
        },
      ],
    }],
  });
}

Step 4: Text-to-Speech

typescript
// Generate speech from text
async function textToSpeech(text: string, outputPath: string) {
  const response = await groq.audio.speech.create({
    model: "playai-tts",          // or "playai-tts-arabic"
    input: text,
    voice: "Arista-PlayAI",      // See Groq docs for voice options
    response_format: "wav",       // wav, mp3, flac, opus, aac
  });

  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
  console.log(`Audio saved to ${outputPath}`);
}

Step 5: Python Audio Transcription

python
from groq import Groq

client = Groq()

# Transcribe
with open("audio.mp3", "rb") as file:
    transcription = client.audio.transcriptions.create(
        file=("audio.mp3", file),
        model="whisper-large-v3-turbo",
        response_format="verbose_json",
    )
    print(transcription.text)
    for segment in transcription.segments:
        print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")

Step 6: Model Benchmarking

typescript
// Compare models on same prompt for speed vs quality
async function benchmarkModels(prompt: string) {
  const models = [
    "llama-3.1-8b-instant",
    "llama-3.3-70b-versatile",
    "llama-3.3-70b-specdec",
  ];

  for (const model of models) {
    const start = performance.now();
    const result = await groq.chat.completions.create({
      model,
      messages: [{ role: "user", content: prompt }],
      max_tokens: 200,
    });
    const elapsed = performance.now() - start;
    const tps = result.usage!.completion_tokens / ((result.usage as any).completion_time || 1);

    console.log(
      `${model.padEnd(45)} | ${elapsed.toFixed(0)}ms | ${tps.toFixed(0)} tok/s | ${result.usage!.total_tokens} tokens`
    );
  }
}

Vision Model Limits

  • Maximum 5 images per request
  • Supported formats: JPEG, PNG, GIF, WebP
  • Images fetched from URL or embedded as base64
  • Vision models also support tool use, JSON mode, and streaming

Error Handling

Error Cause Solution
Invalid file format Unsupported audio type Convert to mp3/wav/flac first
File too large Audio exceeds 25MB Split into smaller chunks
model_not_found Vision model ID wrong Use full path: meta-llama/llama-4-scout-17b-16e-instruct
max_images_exceeded >5 images in request Reduce to 5 or fewer images
429 on Whisper Audio RPM limit hit Queue transcription requests

Resources

Next Steps

For common errors and troubleshooting, see groq-common-errors.

Expand your agent's capabilities with these related and highly-rated skills.

Didn't find tool you were looking for?

Be as detailed as possible for better results