Token-Efficient Delegation

⛔ CRITICAL: No Claude Subagents

HARD STOP: Never use model="haiku", model="sonnet", or model="opus" in Task tool calls.

Always use: model parameter omitted (routes to MiniMax/GLM)
Or explicitly: general-purpose agent (uses cheaper default)
Enforced in: .claude/settings.local.json deny rules

Core Rule: Claude = reasoning/decisions, MiniMax/GLM = research/exploration

The Three Goals (Balance All Three)

Goal	Priority	Strategy
Quality	Keep Opus-level output	Claude synthesizes, subagents gather
Cost	Minimize Claude tokens	Delegate exploration, batch analysis
Speed	Fast execution	Parallel subagents, right model for task

Quick Decision Tree

Before ANY task, ask:

Research/exploration? → Delegate to MiniMax (fast) or GLM (creative)
Web search needed? → Delegate to MiniMax
Reading specific known file? → Claude direct (Read tool)
Analyzing 1-3 images? → Claude direct
Analyzing 10+ images? → Parallel subagents (GLM for quality, MiniMax for speed)
Complex reasoning/code? → Claude direct
Broad codebase search? → Delegate to general-purpose agent (cheaper default)

Model Characteristics (Official Guidance)

Model	Cost	Speed	Best For	Source
Claude Opus	50x	Fast	Complex reasoning, architecture, final decisions	Anthropic Docs
Claude Sonnet	10x	Fast	Coding, medium reasoning, code review	Anthropic Docs
Claude Haiku	2x	Fastest	Exploration, classification, simple tasks	Anthropic Docs
MiniMax M2.1	1x	Fast	Instruction-following, agents, multi-lang coding	MiniMax Docs
GLM-4.7	1x	Medium	Creative problem-solving, math, tool use	Z.AI Docs
GLM-4V	1x	Medium	Vision, 66K multimodal context, GUI tasks	Z.AI Docs

When to Use Each Model

MiniMax M2.1 - Fast Instruction-Follower

Source: MiniMax Official Docs

Quick web searches
Structured data extraction
File reading with specific questions
Tasks with clear instructions
8% of Claude cost, 2x speed
Preserve reasoning chains (return full response with thinking field)

GLM-4.7 - Creative Problem-Solver

Source: Z.AI GLM-4.7 Docs

Complex problem exploration
Mathematical reasoning (95.7% on AIME 2025)
Creative solutions needed
Tool use orchestration (42.8%, ties GPT-5.1)
Parallel batch tasks (latency doesn't matter)
Use thinking mode for complex tasks, disable for simple queries

GLM-4V - Vision Tasks

Source: GLM-V GitHub

66K-token multimodal context
Screenshot/GUI analysis
Document parsing (PDFs, charts)
Video understanding with timestamps
Multi-image analysis

Claude (Haiku/Sonnet/Opus) - Reasoning & Decisions

Source: Anthropic Subagents Docs

Haiku: Fast exploration, read-only tasks, classification
Sonnet: Coding, medium-complexity reasoning
Opus: Architecture decisions, synthesis, complex reasoning
Use subagents to isolate verbose output from main context

Quality Preservation Patterns

Pattern 1: Orchestrator-Worker Split

Claude (Orchestrator) → Spawns subagents → Subagents research → Claude synthesizes

Quality preserved because: Claude makes all decisions based on subagent research.

Pattern 2: Parallel Research + Single Synthesis

5 MiniMax agents research in parallel → Claude receives summaries → Claude decides

Quality preserved because: Multiple perspectives, Claude does final reasoning.

Pattern 3: Two-Stage Review

Subagent implements → Reviewer subagent checks → Claude integrates

Quality preserved because: Cross-validation catches errors before Claude acts.

Pattern 4: Specific Prompts

Source: MiniMax Best Practices

"State the 'why' behind your request - when models understand purpose, they provide more accurate answers."

BAD:  "Research authentication"
GOOD: "Find how Godot 4.5 handles input authentication. I need to implement player login. Return: method name, file location, example usage."

Speed Optimization Patterns

Pattern 1: Parallel Execution

Launch independent tasks simultaneously:

Task(prompt="Research X")  ←─┐
Task(prompt="Research Y")  ←─┼─ Single message, parallel execution
Task(prompt="Research Z")  ←─┘

Pattern 2: Right Model for Latency

Need	Model	Latency
Instant response	Haiku	~100ms
Quick research	MiniMax	~200ms
Deep analysis	GLM-4.7	~500ms
Complex reasoning	Opus	~300ms

Pattern 3: Batch Over Sequential

BAD:  Analyze image 1 → wait → Analyze image 2 → wait → ...
GOOD: Spawn 10 parallel agents → each analyzes 1 image → aggregate

Cost Optimization Patterns

Pattern 1: Delegate Exploration

Source: Anthropic Advanced Tool Use

"Delegate verbose operations (tests, log processing, documentation fetching) to subagents so output stays isolated from main conversation."

Pattern 2: Context Isolation

Subagents prevent context bloat:

Main context: 5000 tokens (stays lean)
Subagent reads 10 files: 15000 tokens (isolated, discarded after)
Result passed back: 500 tokens (only what matters)

Pattern 3: Cascade Routing

Source: Anthropic Model Selection

60-70% of queries → Haiku (cheapest)
20-30% of queries → Sonnet (medium)
10-15% of queries → Opus (only when needed)

Pattern 4: Token Suspension via Background Execution

The Problem: Claude tokens burn while waiting for subagent results.

The Solution: Use run_in_background=true and end the turn early.

┌─────────────────────────────────────────────────────────┐
│ BLOCKING (expensive)                                    │
├─────────────────────────────────────────────────────────┤
│ Claude spawns subagent → waits 60s → gets result       │
│ Cost: 60 seconds of Opus tokens BURNED                  │
├─────────────────────────────────────────────────────────┤
│ BACKGROUND + END TURN (cheap)                           │
├─────────────────────────────────────────────────────────┤
│ Claude spawns with run_in_background=true → ends turn  │
│ Cost: ~0 seconds of Opus tokens (turn ended)           │
│ Subagent runs: 60s of cheap tokens                      │
│ User says "continue" → Claude retrieves with TaskOutput│
└─────────────────────────────────────────────────────────┘

When to suspend (use background):

Research/exploration tasks >30 seconds
Batch image analysis (10+ images)
Web searches with multiple queries
Code review by subagent

When NOT to suspend:

Quick lookups (<5 seconds)
Claude needs result to continue current reasoning
Interactive debugging requiring rapid iteration

Fire-and-Retrieve Workflow:

1. Task(prompt="Research X", run_in_background=true)
   → Returns immediately with task_id and output_file
2. Claude responds: "Research dispatched. Say 'continue' when ready."
3. User prompts again
4. TaskOutput(task_id="...", block=true)
   → Returns full results
5. Claude synthesizes and delivers answer

Key phrase: For tasks >30 seconds, use background execution and end turn early.

Anti-Patterns to Avoid

Anti-Pattern	Problem	Fix
Claude reads 10+ files	Context bloat, expensive	Spawn Explore agent
Sequential subagent calls	Slow, wasted time	Parallel in single message
Re-reading subagent results	Duplicates tokens	Trust the summary, decide
Opus for simple classification	Overkill, expensive	Use Haiku
GLM for quick lookups	Too slow	Use MiniMax

Integration with Project Workflows

HPV (Playtesting)

Use MCP for game state inspection (local, fast)
Delegate screenshot analysis to GLM-4V for batch checks

Visual Development

1-3 sprites: Claude direct
10+ sprites: Parallel GLM-4V agents
Quality gate: Claude reviews subagent findings

Code Implementation

Research patterns: MiniMax subagents
Write code: Claude direct
Review: Haiku subagent for spec compliance

Official Documentation Sources

[Claude Opus 4.5 - 2026-01-29]

Search AI Tools

token-efficient-delegation

Install this agent skill to your Project

SKILL.md