Agent skills
cost-aware-llm-pipeline

Agent skill

cost-aware-llm-pipeline

Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.

View SKILL.md on GitHub Repository

Stars 132,726

Forks 19,206

Install this agent skill to your Project

npx add-skill https://github.com/affaan-m/everything-claude-code/tree/main/skills/cost-aware-llm-pipeline

SKILL.md

Cost-Aware LLM Pipeline

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.

When to Activate

Building applications that call LLM APIs (Claude, GPT, etc.)
Processing batches of items with varying complexity
Need to stay within a budget for API spend
Optimizing cost without sacrificing quality on complex tasks

Core Concepts

1. Model Routing by Task Complexity

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.

python

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. Immutable Cost Tracking

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

python

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. Narrow Retry Logic

Retry only on transient errors. Fail fast on authentication or bad request errors.

python

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. Prompt Caching

Cache long system prompts to avoid resending them on every request.

python

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

Composition

Combine all four techniques in a single pipeline function:

python

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

Pricing Reference (2025-2026)

Model	Input ($/1M tokens)	Output ($/1M tokens)	Relative Cost
Haiku 4.5	$0.80	$4.00	1x
Sonnet 4.6	$3.00	$15.00	~4x
Opus 4.5	$15.00	$75.00	~19x

Best Practices

Start with the cheapest model and only route to expensive models when complexity thresholds are met
Set explicit budget limits before processing batches — fail early rather than overspend
Log model selection decisions so you can tune thresholds based on real data
Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
Never retry on authentication or validation errors — only transient failures (network, rate limit, server error)

Anti-Patterns to Avoid

Using the most expensive model for all requests regardless of complexity
Retrying on all errors (wastes budget on permanent failures)
Mutating cost tracking state (makes debugging and auditing difficult)
Hardcoding model names throughout the codebase (use constants or config)
Ignoring prompt caching for repetitive system prompts

When to Use

Any application calling Claude, OpenAI, or similar LLM APIs
Batch processing pipelines where cost adds up quickly
Multi-model architectures that need intelligent routing
Production systems that need budget guardrails

Maintainer

affaan-m Core maintainer

Source details

Full Name: affaan-m/everything-claude-code
Branch: main
Path in repo: skills/cost-aware-llm-pipeline
License: MIT License
Topics: claude-code anthropic claude mcp ai-agents developer-tools llm productivity

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

affaan-m/everything-claude-code

python-testing

Python testing best practices using pytest including fixtures, parametrization, mocking, coverage analysis, async testing, and test organization. Use when writing or improving Python tests.

132,726 19,206

Explore

affaan-m/everything-claude-code

golang-patterns

Go-specific design patterns and best practices including functional options, small interfaces, dependency injection, concurrency patterns, error handling, and package organization. Use when working with Go code to apply idiomatic Go patterns.

132,726 19,206

Explore

affaan-m/everything-claude-code

e2e-testing

Playwright E2E testing patterns, Page Object Model, configuration, CI/CD integration, artifact management, and flaky test strategies.

132,726 19,206

Explore

affaan-m/everything-claude-code

agentic-engineering

Operate as an agentic engineer using eval-first execution, decomposition, and cost-aware model routing. Use when AI agents perform most implementation work and humans enforce quality and risk controls.

132,726 19,206

Explore

affaan-m/everything-claude-code

api-design

REST API design patterns including resource naming, status codes, pagination, filtering, error responses, versioning, and rate limiting for production APIs.

132,726 19,206

Explore

affaan-m/everything-claude-code

python-patterns

Python-specific design patterns and best practices including protocols, dataclasses, context managers, decorators, async/await, type hints, and package organization. Use when working with Python code to apply Pythonic patterns.

132,726 19,206

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Cost-Aware LLM Pipeline

When to Activate

Core Concepts

1. Model Routing by Task Complexity

2. Immutable Cost Tracking

3. Narrow Retry Logic

4. Prompt Caching

Composition

Pricing Reference (2025-2026)

Best Practices

Anti-Patterns to Avoid

When to Use

Recommended Agent Skills

python-testing

golang-patterns

e2e-testing

agentic-engineering

api-design

python-patterns