Agent skill

batch-quality

Pre-flight validation and quality gates for batch LLM operations. ACTUALLY tests samples through LLM before burning tokens. Uses SPARTA contracts for DuckDB validation queries. Integrates with task-monitor for enforced quality gates.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/batch-quality

SKILL.md

Batch Quality Skill

Prevent wasted LLM calls by validating quality BEFORE running full batch operations.

What This Skill Actually Does

Unlike simple file-existence checks, this skill:

  1. Actually runs LLM on N samples using scillm
  2. Validates JSON response structure (excerpts, source_quality, etc.)
  3. Uses SPARTA contracts for DuckDB validation queries
  4. Integrates with task-monitor for enforced quality gates

Quick Start

bash
cd .pi/skills/batch-quality

# Preflight: Test 3 samples through actual LLM
uv run python cli.py preflight \
    --stage 05 \
    --run-id run-recovery-verify \
    --samples 3

# If preflight passes, run your batch
# ...batch operation...

# Validate: Check DuckDB against contract
uv run python cli.py validate \
    --stage 05 \
    --run-id run-recovery-verify \
    --task-name "sparta-stage-05"

Commands

preflight

Test N samples through actual LLM before running full batch.

bash
uv run python cli.py preflight \
    --stage <stage-name> \
    --run-id <sparta-run-id> \
    --samples 3 \
    --prompt <optional-prompt-file>

What it actually does:

  1. Loads SPARTA contract for the stage (if exists)
  2. Checks environment variables (CHUTES_API_KEY, CHUTES_TEXT_MODEL)
  3. Connects to DuckDB for the run
  4. Samples N items from the input queue
  5. Runs each sample through scillm (actual LLM call)
  6. Validates JSON response structure
  7. Requires 50%+ samples to pass

Exit codes:

  • 0: PASSED - safe to proceed
  • 1: FAILED - fix issues first

validate

Validate batch output using SPARTA contracts.

bash
uv run python cli.py validate \
    --stage <stage-name> \
    --run-id <sparta-run-id> \
    --task-name <task-monitor-name>

What it actually does:

  1. Loads SPARTA contract (e.g., 05_extract_knowledge.json)
  2. Runs all validation_queries from contract against DuckDB
  3. Checks each query result against expected_min
  4. Notifies task-monitor of pass/fail

Contract example (05_extract_knowledge.json):

json
{
  "validation_queries": [
    {"name": "url_knowledge_count", "query": "SELECT COUNT(*) FROM url_knowledge", "expected_min": 10},
    {"name": "urls_processed", "query": "SELECT COUNT(*) FROM url_extraction_log WHERE ok = true", "expected_min": 5}
  ]
}

status

Check current preflight status (JSON output).

bash
uv run python cli.py status

clear

Clear preflight state (requires new preflight).

bash
uv run python cli.py clear

SPARTA Pipeline Integration

bash
# 1. Register task with validation requirement
uv run python .pi/skills/task-monitor/monitor.py register \
    --name "sparta-stage-05" \
    --require-validation

# 2. Run preflight (ACTUALLY tests LLM)
uv run python .pi/skills/batch-quality/cli.py preflight \
    --stage 05 \
    --run-id run-recovery-verify \
    --samples 3

# 3. Run batch (only if preflight passed)
uv run python -m sparta.pipeline_duckdb.05_extract_knowledge \
    --run-id run-recovery-verify

# 4. Validate using contract queries
uv run python .pi/skills/batch-quality/cli.py validate \
    --stage 05 \
    --run-id run-recovery-verify \
    --task-name "sparta-stage-05"

Configuration

Environment variables:

  • SPARTA_ROOT: Path to SPARTA project (default: /home/graham/workspace/experiments/sparta)
  • CHUTES_API_KEY: API key for LLM calls
  • CHUTES_API_BASE: API base URL (default: https://llm.chutes.ai/v1)
  • CHUTES_TEXT_MODEL: Model ID for text extraction

Contract location: $SPARTA_ROOT/tools/pipeline_gates/fixtures/D3-FEV/contracts/

Dependencies

  • typer - CLI framework
  • duckdb - Database queries
  • scillm - LLM batch processing (for actual sample testing)

Key Principle

Preflight is cheap. Failed batches are expensive.

Testing 3 samples costs ~$0.01 and takes 30 seconds. Running 1000 items with a broken prompt costs ~$3 and takes hours.

Didn't find tool you were looking for?

Be as detailed as possible for better results