Agent skill
Workers AI
This skill should be used when the user asks about "Workers AI", "AI models", "text generation", "embeddings", "semantic search", "RAG", "Retrieval Augmented Generation", "AI inference", "LLaMA", "Llama", "bge embeddings", "@cf/ models", "AI Gateway", or discusses implementing AI features, choosing AI models, generating embeddings, or building RAG systems on Cloudflare Workers.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/product/workers-ai
SKILL.md
Workers AI
Purpose
This skill provides comprehensive guidance for using Workers AI, Cloudflare's AI inference platform. It covers available models, inference patterns, embedding generation, RAG (Retrieval Augmented Generation) architectures, AI Gateway integration, and best practices for AI workloads. Use this skill when implementing AI features, selecting models, building RAG systems, or optimizing AI inference on Workers.
Workers AI Overview
Workers AI provides serverless AI inference at the edge with:
- Text Generation: LLMs for chat, completion, summarization
- Embeddings: Vector representations for semantic search
- Image Generation: Text-to-image models
- Vision: Image classification and object detection
- Speech: Text-to-speech and automatic speech recognition
- Translation: Language translation models
Key Benefits
- Edge deployment: Low latency inference globally
- No infrastructure: Serverless, auto-scaling
- Integrated: Native integration with Workers, Vectorize, D1
- Cost-effective: Pay per inference, no minimum
- Latest models: Llama 3.1, Mistral, BAAI embeddings
Model Categories
Text Generation Models
LLaMA 3.1 (Recommended):
@cf/meta/llama-3.1-8b-instruct- Chat and instruction following- Best for: Conversational AI, Q&A, summarization, general text generation
- Context window: 128K tokens
- Multilingual support
Mistral:
@cf/mistral/mistral-7b-instruct-v0.2- Fast instruction following- Best for: Quick responses, simpler tasks
- Context window: 32K tokens
Qwen:
@cf/qwen/qwen1.5-14b-chat-awq- Quantized for efficiency- Best for: Balance between speed and quality
See references/workers-ai-models.md for complete model catalog with specifications and use cases.
Embedding Models
BGE Base (Recommended for English):
@cf/baai/bge-base-en-v1.5- High-quality English embeddings- Dimensions: 768
- Best for: RAG, semantic search, English content
BGE Large (Higher Quality):
@cf/baai/bge-large-en-v1.5- Higher quality, more compute- Dimensions: 1024
- Best for: When quality is critical
BGE Small (Faster):
@cf/baai/bge-small-en-v1.5- Faster, smaller model- Dimensions: 384
- Best for: When speed is critical, large volumes
Multilingual:
@cf/baai/bge-m3- Multilingual support- Best for: Multi-language content
Image Generation
Stable Diffusion:
@cf/stabilityai/stable-diffusion-xl-base-1.0- Text-to-image@cf/bytedance/stable-diffusion-xl-lightning- Faster generation- Best for: Creating images from text descriptions
Vision Models
Image Classification:
@cf/microsoft/resnet-50- Object recognition- Best for: Classifying image content
Speech Models
Text-to-Speech:
@cf/meta/m2m100-1.2b- Multilingual speech synthesis
Automatic Speech Recognition:
@cf/openai/whisper- Speech-to-text- Best for: Transcribing audio
Text Generation
Basic Inference
export default {
async fetch(request, env, ctx) {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is Cloudflare Workers?' }
]
});
return new Response(JSON.stringify(response));
}
};
Streaming Responses
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'user', content: 'Write a story about...' }
],
stream: true
});
return new Response(stream, {
headers: { 'Content-Type': 'text/event-stream' }
});
Model Parameters
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [/* messages */],
max_tokens: 512, // Max tokens to generate
temperature: 0.7, // Creativity (0-1, higher = more random)
top_p: 0.9, // Nucleus sampling
top_k: 40, // Top-k sampling
repetition_penalty: 1.2 // Penalize repetition
});
Parameter guidelines:
- temperature: 0.1-0.3 for factual, 0.7-0.9 for creative
- max_tokens: Set based on expected response length
- top_p/top_k: Usually leave at defaults unless fine-tuning behavior
Embeddings
Generating Embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: ['Hello world', 'Another sentence']
}) as { data: number[][] };
const vector1 = embeddings.data[0]; // [0.123, -0.456, ...]
const vector2 = embeddings.data[1];
Important TypeScript note: Always add as { data: number[][] } type assertion when using embeddings API.
Batch Processing
// Batch multiple texts for efficiency
const texts = documents.map(d => d.content);
// Process in batches of 100 (recommended batch size)
const batchSize = 100;
const allEmbeddings = [];
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: batch
}) as { data: number[][] };
allEmbeddings.push(...result.data);
}
Text Chunking for Embeddings
For long documents, split into chunks before embedding:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500, // Characters per chunk
chunkOverlap: 50 // Overlap between chunks
});
const chunks = await splitter.splitText(longDocument);
// Generate embedding for each chunk
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: chunks
}) as { data: number[][] };
// Store each chunk with its embedding
for (let i = 0; i < chunks.length; i++) {
await env.VECTOR_INDEX.insert([{
id: `${docId}-chunk-${i}`,
values: embeddings.data[i],
metadata: { text: chunks[i], docId, chunkIndex: i }
}]);
}
See references/rag-architecture-patterns.md for complete RAG implementation patterns.
RAG (Retrieval Augmented Generation)
Basic RAG Pattern
async function answerQuestion(question, env) {
// 1. Generate question embedding
const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [question]
}) as { data: number[][] };
// 2. Find similar documents
const similar = await env.VECTOR_INDEX.query(questionEmbedding.data[0], {
topK: 3,
returnMetadata: true
});
// 3. Build context from retrieved documents
const context = similar.matches
.map(match => match.metadata.text)
.join('\n\n');
// 4. Generate answer with context
const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{
role: 'system',
content: 'Answer the question using only the provided context. If the answer is not in the context, say "I don\'t have enough information."'
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`
}
]
});
return {
answer: answer.response,
sources: similar.matches.map(m => ({
score: m.score,
text: m.metadata.text
}))
};
}
Advanced RAG with Reranking
async function advancedRAG(question, env) {
// 1. Retrieve more candidates (top 10)
const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: [question]
}) as { data: number[][] };
const candidates = await env.VECTOR_INDEX.query(questionEmbedding.data[0], {
topK: 10
});
// 2. Rerank with LLM for relevance
const reranked = [];
for (const candidate of candidates.matches) {
const relevance = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Rate the relevance of this passage to the question on a scale of 0-10:\n\nQuestion: ${question}\n\nPassage: ${candidate.metadata.text}\n\nRating (just the number):`
}],
max_tokens: 5
});
const score = parseInt(relevance.response);
if (score >= 7) {
reranked.push({ ...candidate, rerankScore: score });
}
}
// 3. Use top reranked results
reranked.sort((a, b) => b.rerankScore - a.rerankScore);
const topResults = reranked.slice(0, 3);
const context = topResults.map(r => r.metadata.text).join('\n\n');
// 4. Generate answer
const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'system',
content: 'Answer based on the context provided.'
}, {
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}`
}]
});
return { answer: answer.response, sources: topResults };
}
See examples/rag-implementation.js for complete RAG examples.
AI Gateway
AI Gateway provides caching, rate limiting, and analytics for AI requests.
Configuration
// wrangler.jsonc
{
"ai": {
"binding": "AI",
"gateway_id": "my-gateway"
}
}
Benefits
- Caching: Cache identical requests, reduce costs
- Rate limiting: Protect against abuse
- Analytics: Track usage, costs, latency
- Fallback: Automatic retry and fallback logic
Usage
// Requests automatically go through AI Gateway when configured
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Hello' }]
});
// Gateway handles caching, rate limiting, analytics automatically
Best Practices
Model Selection
-
Text Generation:
- Simple tasks:
mistral-7b-instruct - Complex tasks:
llama-3.1-8b-instruct - Long context:
llama-3.1-8b-instruct(128K context)
- Simple tasks:
-
Embeddings:
- English:
bge-base-en-v1.5 - Multilingual:
bge-m3 - Speed critical:
bge-small-en-v1.5 - Quality critical:
bge-large-en-v1.5
- English:
Prompt Engineering
Good prompts:
// Be specific
{ role: 'user', content: 'Summarize this article in 3 bullet points: ...' }
// Provide context
{ role: 'system', content: 'You are an expert programmer.' }
// Use examples (few-shot)
{
role: 'user',
content: 'Example: Input "hello" -> Output "HELLO"\nInput "world" ->'
}
Avoid:
- Vague instructions
- Very long prompts without structure
- Asking for multiple unrelated tasks in one request
Cost Optimization
-
Cache results: Use KV to cache AI responses
javascriptconst cacheKey = `ai:${hash(prompt)}`; let cached = await env.CACHE.get(cacheKey); if (!cached) { cached = await env.AI.run(model, params); await env.CACHE.put(cacheKey, JSON.stringify(cached), { expirationTtl: 3600 }); } -
Use AI Gateway: Automatic caching and rate limiting
-
Batch embeddings: Process multiple texts together
-
Right-size models: Use smaller models when possible
-
Optimize prompts: Shorter prompts = lower cost
Performance Optimization
-
Streaming: Use streaming for long responses to improve perceived latency
-
Parallel requests: Use
Promise.all()for independent AI callsjavascriptconst [summary, sentiment] = await Promise.all([ env.AI.run(model, { messages: [summaryPrompt] }), env.AI.run(model, { messages: [sentimentPrompt] }) ]); -
Early termination: Use
max_tokensto limit output -
Async with waitUntil: For non-critical AI tasks
javascriptctx.waitUntil( generateAnalytics(request, env) );
RAG Best Practices
-
Chunk size: 300-500 characters for optimal retrieval
-
Overlap: 10-20% overlap between chunks to preserve context
-
Top-K selection: 3-5 documents usually optimal
-
Reranking: Consider LLM-based reranking for better quality
-
Metadata: Store source information for citation
-
Hybrid search: Combine vector search with keyword search for best results
Pricing and Limits
Pricing Model
- Charged per neuron (unit of inference)
- Varies by model complexity
- Free tier available
- AI Gateway caching reduces costs
Rate Limits
- Model-specific rate limits
- Scale based on account type
- Use AI Gateway for automatic rate limiting
Quotas
- Free tier: Limited neurons/month
- Paid tier: Higher limits, pay as you go
- Enterprise: Custom quotas
See Cloudflare documentation or use cloudflare-docs-specialist agent for current pricing.
Common Patterns
Pattern 1: Conversational AI
// Maintain conversation history
const history = await env.KV.get(`chat:${sessionId}`, 'json') || [];
history.push({ role: 'user', content: userMessage });
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: history
});
history.push({ role: 'assistant', content: response.response });
await env.KV.put(`chat:${sessionId}`, JSON.stringify(history), {
expirationTtl: 3600
});
Pattern 2: Document Analysis
// Analyze document with AI
const analysis = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Analyze this document and extract:\n1. Main topics\n2. Key entities\n3. Sentiment\n\nDocument: ${documentText}`
}]
});
Pattern 3: Content Generation
// Generate content with specific format
const blogPost = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'system',
content: 'You are a professional blog writer.'
}, {
role: 'user',
content: `Write a blog post about ${topic}. Format:\n# Title\n## Introduction\n## Main Points\n## Conclusion`
}],
temperature: 0.8 // Higher creativity for content generation
});
Pattern 4: Data Extraction
// Extract structured data from unstructured text
const extracted = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{
role: 'user',
content: `Extract the following from this email and return as JSON:\n- Name\n- Email\n- Company\n- Message\n\nEmail: ${emailText}\n\nJSON:`
}],
temperature: 0.1 // Low temperature for factual extraction
});
const data = JSON.parse(extracted.response);
Troubleshooting
Issue: "Model not found"
- Solution: Check model name, ensure it starts with
@cf/
Issue: "Rate limit exceeded"
- Solution: Use AI Gateway, implement caching, batch requests
Issue: "Embeddings dimension mismatch"
- Solution: Ensure Vectorize index dimensions match embedding model (e.g., 768 for bge-base-en-v1.5)
Issue: "Timeout on long generation"
- Solution: Use streaming, reduce
max_tokens, or split into smaller requests
Issue: "Poor RAG results"
- Solution: Improve chunking strategy, increase top-K, add reranking, refine prompts
Additional Resources
Reference Files
For detailed information, consult:
references/workers-ai-models.md- Complete model catalog with specs and use casesreferences/rag-architecture-patterns.md- RAG implementation patterns and strategies
Example Files
Working examples in examples/:
rag-implementation.js- Complete RAG system with Vectorizetext-generation-examples.js- Various text generation patterns
Documentation Links
For the latest Workers AI documentation:
- Workers AI overview: https://developers.cloudflare.com/workers-ai/
- Models: https://developers.cloudflare.com/workers-ai/models/
- AI Gateway: https://developers.cloudflare.com/ai-gateway/
Use the cloudflare-docs-specialist agent to search AI documentation and the workers-ai-specialist agent for implementation guidance.
Didn't find tool you were looking for?