Agent skill
ai-llm-engineering
Operational skill hub for LLM system architecture, evaluation, deployment, and optimization (modern production standards). Links to specialized skills for prompts, RAG, agents, and safety. Integrates recent advances: PEFT/LoRA fine-tuning, hybrid RAG handoff (see dedicated skill), vLLM 24x throughput, multi-layered security (90%+ bypass for single-layer), automated drift detection (18-second response), and CI/CD-aligned evaluation.
Install this agent skill to your Project
npx add-skill https://github.com/Microck/ordinary-claude-skills/tree/main/skills_categorized/arts-crafts/ai-llm-engineering
SKILL.md
LLM Engineering – Operational Skill Hub
A single resource for executing, validating, and scaling LLM systems with modern production standards, while delegating domain depth to specialized skills.
This skill provides quick reference, decision frameworks, and navigation to detailed operational patterns for:
- Data, training, fine-tuning (PEFT/LoRA standard)
- Evaluation (automated testing, metrics, rollout gates)
- Deployment (vLLM 24x throughput, FP8/FP4 quantization)
- LLMOps (automated drift detection, retraining)
- Safety (multi-layered defenses, AI-powered guardrails)
For detailed patterns: See Resources and Templates sections below.
Quick Reference
| Task | Tool/Framework | Command/Pattern | When to Use |
|---|---|---|---|
| RAG Pipeline | LlamaIndex, LangChain | Page-level chunking + hybrid retrieval | Dynamic knowledge, 0.648 accuracy |
| Agentic Workflow | LangGraph, AutoGen, CrewAI | ReAct, multi-agent orchestration | Complex tasks, tool use required |
| Prompt Design | Anthropic, OpenAI guides | CoT, few-shot, structured | Task-specific behavior control |
| Evaluation | LangSmith, W&B, RAGAS | Multi-metric (hallucination, bias, cost) | Quality validation, A/B testing |
| Production Deploy | vLLM, TensorRT-LLM | FP8/FP4 quantization, 24x throughput | High-throughput serving, cost optimization |
| Monitoring | Arize Phoenix, LangFuse | Drift detection, 18-second response | Production LLM systems |
Decision Tree: LLM System Architecture
Building LLM application: [Architecture Selection]
├─ Need current knowledge?
│ ├─ Simple Q&A? → Basic RAG (page-level chunking + hybrid retrieval)
│ └─ Complex retrieval? → Advanced RAG (reranking + contextual retrieval)
│
├─ Need tool use / actions?
│ ├─ Single task? → Simple agent (ReAct pattern)
│ └─ Multi-step workflow? → Multi-agent (LangGraph, CrewAI)
│
├─ Static behavior sufficient?
│ ├─ Quick MVP? → Prompt engineering (CI/CD integrated)
│ └─ Production quality? → Fine-tuning (PEFT/LoRA)
│
└─ Best results?
└─ Hybrid (RAG + Fine-tuning + Agents) → Comprehensive solution
See Decision Matrices for detailed selection criteria.
When to Use This Skill
Claude should invoke this skill when the user asks about:
- LLM preflight/project checklists, production best practices, or data pipelines
- Building or deploying RAG, agentic, or prompt-based LLM apps
- Prompt design, chain-of-thought (CoT), ReAct, or template patterns
- Troubleshooting LLM hallucination, bias, retrieval issues, or production failures
- Evaluating LLMs: benchmarks, multi-metric eval, or rollout/monitoring
- LLMOps: deployment, rollback, scaling, resource optimization
- Technology stack selection (models, vector DBs, frameworks)
- Production deployment strategies and operational patterns
Scope Boundaries (Use These Skills for Depth)
- Prompt design & CI/CD → ai-prompt-engineering
- RAG pipelines & chunking → ai-llm-rag-engineering
- Search tuning (BM25, HNSW, hybrid) → ai-llm-search-retrieval
- Agent architectures & tools → ai-agents-development
- Serving optimization/quantization → ai-llm-ops-inference
- Production deployment/monitoring → ai-ml-ops-production
- Security/guardrails → ai-ml-ops-security
Resources (Best Practices & Operational Patterns)
Comprehensive operational guides with checklists, patterns, and decision frameworks:
Core Operational Patterns
-
Project Planning Patterns - Stack selection, FTI pipeline, performance budgeting
- AI engineering stack selection matrix
- Feature/Training/Inference (FTI) pipeline blueprint
- Performance budgeting and goodput gates
- Progressive complexity (prompt → RAG → fine-tune → hybrid)
-
Production Checklists - Pre-deployment validation and operational checklists
- LLM lifecycle checklist (modern production standards)
- Data & training, RAG pipeline, deployment & serving
- Safety/guardrails, evaluation, agentic systems
- Reliability & data infrastructure (DDIA-grade)
- Weekly production tasks
-
Common Design Patterns - Copy-paste ready implementation examples
- Chain-of-Thought (CoT) prompting
- ReAct (Reason + Act) pattern
- RAG pipeline (minimal to advanced)
- Agentic planning loop
- Self-reflection and multi-agent collaboration
-
Decision Matrices - Quick reference tables for selection
- RAG type decision matrix (naive → advanced → modular)
- Production evaluation table with targets and actions
- Model selection matrix (GPT-4, Claude, Gemini, self-hosted)
- Vector database, embedding model, framework selection
- Deployment strategy matrix
-
Anti-Patterns - Common mistakes and prevention strategies
- Data leakage, prompt dilution, RAG context overload
- Agentic runaway, over-engineering, ignoring evaluation
- Hard-coded prompts, missing observability
- Detection methods and prevention code examples
Domain-Specific Patterns
- LLMOps Best Practices - Operational lifecycle and deployment patterns
- Evaluation Patterns - Testing, metrics, and quality validation
- Prompt Engineering Patterns - Quick reference (canonical skill: ai-prompt-engineering)
- Agentic Patterns - Quick reference (canonical skill: ai-agents-development)
- RAG Best Practices - Quick reference (canonical skill: ai-llm-rag-engineering)
Note: Each resource file includes preflight/validation checklists, copy-paste reference tables, inline templates, anti-patterns, and decision matrices.
Templates (Copy-Paste Ready)
Production templates by use case and technology:
RAG Pipelines
- Basic RAG - Simple retrieval-augmented generation
- Advanced RAG - Hybrid retrieval, reranking, contextual embeddings
Prompt Engineering
- Chain-of-Thought - Step-by-step reasoning pattern
- ReAct - Reason + Act for tool use
Agentic Workflows
- Reflection Agent - Self-critique and improvement
- Multi-Agent - Manager-worker orchestration
Data Pipelines
- Data Quality - Validation, deduplication, PII detection
Deployment
- LLM Deployment - Production deployment with monitoring
Evaluation
- Multi-Metric Evaluation - Comprehensive testing suite
Related Skills
This skill integrates with complementary Claude Code skills:
Core Dependencies
- ai-llm-rag-engineering - Advanced RAG patterns, chunking strategies, hybrid retrieval, reranking
- ai-llm-search-retrieval - Search optimization, BM25 tuning, vector search, ranking pipelines
- ai-prompt-engineering - Systematic prompt design, evaluation, testing, and optimization
- ai-agents-development - Agent architectures, tool use, multi-agent systems, autonomous workflows
Production & Operations
- ai-llm-development - Model training, fine-tuning, dataset creation, instruction tuning
- ai-llm-ops-inference - Production serving, quantization, batching, GPU optimization
- ai-ml-ops-production - Deployment patterns, monitoring, drift detection, API design
- ai-ml-ops-security - Security guardrails, prompt injection defense, privacy protection
External Resources
See data/sources.json for 50+ curated authoritative sources:
- Official LLM platform docs - OpenAI, Anthropic, Gemini, Mistral, Azure OpenAI, AWS Bedrock
- Open-source models and frameworks - HuggingFace Transformers, LLaMA, vLLM, PEFT/LoRA, DeepSpeed
- RAG frameworks and vector DBs - LlamaIndex, LangChain, LangGraph, Haystack, Pinecone, Qdrant, Chroma
- 2025 Agentic frameworks - Anthropic Agent SDK, AutoGen, CrewAI, LangGraph Multi-Agent, Semantic Kernel
- 2025 RAG innovations - Microsoft GraphRAG (knowledge graphs), Pathway (real-time), hybrid retrieval
- Prompt engineering - Anthropic Prompt Library, Prompt Engineering Guide, CoT/ReAct patterns
- Evaluation and monitoring - OpenAI Evals, HELM, Anthropic Evals, LangSmith, W&B, Arize Phoenix
- Production deployment - LiteLLM, Ollama, RunPod, Together AI, vLLM serving
Usage
For New Projects
- Start with Production Checklists - Validate all pre-deployment requirements
- Use Decision Matrices - Select technology stack
- Reference Project Planning Patterns - Design FTI pipeline
- Implement with Common Design Patterns - Copy-paste code examples
- Avoid Anti-Patterns - Learn from common mistakes
For Troubleshooting
- Check Anti-Patterns - Identify failure modes and mitigations
- Use Decision Matrices - Evaluate if architecture fits use case
- Reference Common Design Patterns - Verify implementation correctness
For Ongoing Operations
- Follow Production Checklists - Weekly operational tasks
- Integrate Evaluation Patterns - Continuous quality monitoring
- Apply LLMOps Best Practices - Deployment and rollback procedures
Navigation Summary
Quick Decisions: Decision Matrices Pre-Deployment: Production Checklists Planning: Project Planning Patterns Implementation: Common Design Patterns Troubleshooting: Anti-Patterns
Domain Depth: LLMOps | Evaluation | Prompts | Agents | RAG
Templates: templates/ - Copy-paste ready production code
Sources: data/sources.json - Authoritative documentation links
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
nondominium-holochain-dna-dev
Specialized skill for nondominium Holochain DNA development, focusing on zome creation, entry patterns, integrity/coordinator architecture, ValueFlows compliance, and WASM optimization. Use when creating new zomes, implementing entry types, or modifying Holochain DNA code.
fluidsim
Framework for computational fluid dynamics simulations using Python. Use when running fluid dynamics simulations including Navier-Stokes equations (2D/3D), shallow water equations, stratified flows, or when analyzing turbulence, vortex dynamics, or geophysical flows. Provides pseudospectral methods with FFT, HPC support, and comprehensive output analysis.
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
run-tests
Validate code changes by intelligently selecting and running the appropriate test suites. Use this when editing code to verify changes work correctly, run tests, validate functionality, or check for regressions. Automatically discovers affected test suites, selects the minimal set of venvs needed for validation, and handles test execution with Docker services as needed.
skill-navigator
The 100th skill! Your intelligent guide to all 99 other skills. Recommends the perfect skill for any task, creates skill combinations, and helps you discover capabilities you didn't know you had.
AgentDB Advanced Features
Master advanced AgentDB features including QUIC synchronization, multi-database management, custom distance metrics, hybrid search, and distributed systems integration. Use when building distributed AI systems, multi-agent coordination, or advanced vector search applications.
Didn't find tool you were looking for?