Agent skill
qa
Black-box QA audit of slop-guard across MCP, CLI, fit, docs, agent workflows, and writing-effectiveness. Files GitHub issues for real problems found.
Install this agent skill to your Project
npx add-skill https://github.com/eric-tramel/slop-guard/tree/main/.claude/skills/qa
SKILL.md
Slop-Guard QA Audit
Black-box QA of slop-guard through its public interfaces only. Do not read source files or tests to find bugs — discover issues by using the product as an agent or user would.
Launch parallel subagents per testing angle. File genuine issues with gh issue create. If $ARGUMENTS specifies a focus area (mcp, cli, fit, docs, workflows, or effectiveness), run only that angle. Otherwise run all 6 in parallel.
Setup
- Read
README.mdfor documented behavior. - If
mcp__slop_guard__*tools are available, fetch their schemas via tool search. - Fetch open issues to avoid duplicates:
bash
gh issue list --repo eric-tramel/slop-guard --limit 100 --state open --json number,title,body,labels - Prefer local QA fixtures if present. Otherwise create temp files under
/tmp/slop-guard-qa. Do not write throwaway files into the repo.
Pass the open issues list into every subagent prompt.
Testing Angles
1. MCP Agent Interface (mcp)
Is the MCP surface clear, consistent, and useful for an agent?
- Tool descriptions, parameter clarity, distinction between
check_slopandcheck_slop_file - JSON output structure:
score,band,violations,counts,advice,file - Edge cases: short text, empty strings, unicode, markdown, code fences, long inputs
- Missing/unreadable file paths
- Response size and parseability
- Cross-check MCP vs CLI JSON output on the same input
Issue prefix: mcp:
2. CLI UX (cli)
Does uv run sg behave predictably?
- All flags from
sg --help:--json,--verbose,--quiet,--threshold,--score-only,--counts,--config,--version - Input modes: file paths, inline text, stdin (
-), multiple inputs - Exit codes: 0 (pass), 1 (threshold fail), 2 (error)
- Stdout/stderr separation for scripting
- Path-vs-inline-text ambiguity
- Missing files, bad config paths
Issue prefix: cli:
3. Fit Workflow (fit)
Can a user fit a custom rule config from docs alone?
- Legacy shorthand:
sg-fit TARGET_CORPUS OUTPUT - Multi-input mode with
--output - All input formats:
.jsonl,.txt,.md --negative-dataset,--no-calibration,--init- Error cases: malformed JSONL, missing
textfield, invalidlabel, missing files, unsupported suffixes - Round-trip: is the fitted JSONL immediately usable with
sg -c?
Issue prefix: fit:
4. Docs and Onboarding (docs)
Do install flows and docs match actual behavior?
- Install paths:
uvx slop-guard,uv tool install slop-guard,uv run sg - MCP setup snippets for Claude and Codex
- Custom config examples
- Whether documented commands, flags, and tool names match reality
Issue prefix: docs:
5. End-to-End Workflows (workflows)
Does slop-guard compose well for real agent work?
- Lint prose via MCP or CLI, assess whether advice is actionable enough to drive a rewrite
- Lint a real file via
check_slop_fileorsg README.md - CI gating with
sg -t 60 ...— is the output machine-parseable? - Compare MCP and CLI results on the same input
Focus on cross-surface consistency, scoring coherence, and whether the product actually improves agent writing workflows.
Issue prefix: workflow:
6. Writing Effectiveness (effectiveness)
Does slop-guard's feedback actually make an agent produce better writing? This angle tests the quality of the feedback loop — not whether the tool runs correctly, but whether following its guidance leads to measurably improved prose.
Method: Act as an agent that writes, scores, interprets advice, rewrites, and rescores. Evaluate every step for friction, ambiguity, and actual improvement.
6a. Advice Actionability
For each test, write a paragraph in a specific register (technical docs, blog post, marketing copy, academic summary, casual explanation), score it, then attempt to follow every piece of advice literally.
- Is each advice string specific enough to know what to change? ("Replace 'crucial' — what specifically do you mean?" is good; a vague suggestion would be bad)
- Does the advice tell you how to fix it, or only that something is wrong?
- When multiple violations overlap in the same sentence, does the combined advice make sense or contradict itself?
- Are there violations with no corresponding advice entry? (The advice array should cover every fixable issue)
- Does any advice lead you toward a different slop pattern? (e.g., replacing a slop word with another slop word)
6b. Feedback Loop Convergence
Run iterative rewrite cycles and track whether the score converges upward:
- Write 3-5 paragraphs of deliberately sloppy AI-style prose (heavy on slop words, uniform rhythm, bold-bullet structures, contrast pairs)
- Score via
check_sloporsg --json - Rewrite following only the advice array — no independent judgment
- Rescore the rewrite
- Repeat until score stabilizes or advice is empty
Evaluate:
- Does the score improve monotonically with each cycle? If not, what caused a regression?
- How many cycles to reach
cleanband? (Should be 1-2 for light, 2-3 for moderate) - Does the advice array shrink each cycle, or do new violations appear as old ones are fixed?
- Is there a score plateau where advice remains but following it doesn't move the score? What's blocking?
- At convergence, does the text actually read well — or has it been "optimized" into something sterile/awkward?
6c. Score Sensitivity and Proportionality
Test whether scores reflect actual writing quality differences:
- Write two versions of the same content: one natural, one AI-sloppy. Do scores separate cleanly?
- Take clean prose (human-written, score >80) and introduce one slop word. How much does the score drop? Is the penalty proportional?
- Take a short text (<50 words) vs a long text (~500 words) with the same density of violations. Are scores comparable?
- Does the concentration penalty feel right? Write text with 1 contrast pair (should be mild) vs 5 contrast pairs (should be harsh). Check that scoring reflects this.
6d. Advice Interpretation by an Agent
Simulate how an LLM agent would parse and act on the JSON output:
- Given the violations array with
matchandcontextfields, can you locate the exact position in the original text to edit? Is the 60-char context window sufficient? - When advice says "Replace X — what specifically do you mean?", does the agent have enough context from the surrounding violations to pick a good replacement?
- Are the
countsuseful for triage? (e.g., "12 slop_words vs 1 rhythm issue" — should the agent focus on vocabulary first?) - If the agent can only make one edit, does the output help prioritize which fix has the highest impact? (Penalties vary: -10 for ai_disclosure vs -1 for a slop word)
- Does the band label (
light,moderate, etc.) help the agent decide whether to rewrite or accept?
6e. Register Sensitivity
Test whether slop-guard handles different writing registers fairly:
- Technical documentation: Do legitimate technical terms get false-positived as slop? (e.g., "framework", "robust", "scalable" in a systems design context)
- Persuasive/marketing copy: This register naturally uses more emphatic language. Does slop-guard penalize it unfairly, or does it correctly distinguish marketing voice from AI slop?
- Academic writing: Hedging language ("arguably", "perhaps", "it seems") is standard in academic prose. Does the weasel rule over-penalize legitimate scholarly register?
- Conversational/casual: Does informal writing score better simply because it avoids formal slop patterns, even if it's low-quality?
For each register, write a good example and a sloppy example. The tool should score good writing higher regardless of register.
6f. Rewrite Quality Assessment
After the agent completes a rewrite cycle, evaluate the output text:
- Does the rewritten text preserve the original meaning and intent?
- Has the rewrite introduced awkward phrasing or unnatural constructions to dodge rules?
- Is the rewritten text something a human editor would accept, or does it feel "linted" — technically clean but lifeless?
- Compare: original sloppy text vs rewritten text vs hand-written alternative. Where does the tool-guided rewrite land on that spectrum?
File issues for patterns where following advice consistently produces worse prose, where scores don't reflect quality, where the feedback loop stalls, or where the interface makes it hard for an agent to act on feedback.
Issue prefix: effectiveness:
Issue Filing
Before filing, check every open issue — skip if same root cause, same behavior, or strict subset. When in doubt, don't file. If related but distinct, add Related: #N.
Use existing repo labels (bug, enhancement, documentation, etc.). Each issue body must include:
- Summary (user/agent perspective)
- Reproduction steps (exact commands or tool calls)
- Expected vs observed behavior
- Severity:
critical/high/medium/low - Surface:
mcp/cli/fit/docs/workflow/effectiveness Generated with [Claude Code](https://claude.com/claude-code)
For effectiveness issues: use the enhancement label unless the issue describes advice that actively misleads (then bug). Include the original text, the advice received, the rewrite attempt, and before/after scores. Concrete examples are mandatory — do not file vague "advice could be better" issues.
Summary Report
After all angles complete, compile results grouped by severity. Include: total issues filed, findings already tracked, what works well, what was not tested, and whether MCP tools were available.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
obsidian-clipper-template-creator
Guide for creating templates for the Obsidian Web Clipper. Use when you want to create a new clipping template, understand available variables, or format clipped content.
claude-code-expert
Especialista profundo em Claude Code - CLI da Anthropic. Maximiza produtividade com atalhos, hooks, MCPs, configuracoes avancadas, workflows, CLAUDE.md, memoria, sub-agentes, permissoes e integracao com ecossistemas.
lex
Centralized 'Truth Engine' for cross-jurisdictional legal context (US, EU, CA) and contract scaffolding.
odoo-inventory-optimizer
Expert guide for Odoo Inventory: stock valuation (FIFO/AVCO), reordering rules, putaway strategies, routes, and multi-warehouse configuration.
android_ui_verification
Automated end-to-end UI testing and verification on an Android Emulator using ADB.
seo-cannibalization-detector
Analyzes multiple provided pages to identify keyword overlap and potential cannibalization issues. Suggests differentiation strategies. Use PROACTIVELY when reviewing similar content.
Didn't find tool you were looking for?