Agent skill

qa

Black-box QA audit of slop-guard across MCP, CLI, fit, docs, agent workflows, and writing-effectiveness. Files GitHub issues for real problems found.

View SKILL.md on GitHub Repository

Stars 113

Forks 8

Install this agent skill to your Project

npx add-skill https://github.com/eric-tramel/slop-guard/tree/main/.claude/skills/qa

SKILL.md

Slop-Guard QA Audit

Black-box QA of slop-guard through its public interfaces only. Do not read source files or tests to find bugs — discover issues by using the product as an agent or user would.

Launch parallel subagents per testing angle. File genuine issues with gh issue create. If $ARGUMENTS specifies a focus area (mcp, cli, fit, docs, workflows, or effectiveness), run only that angle. Otherwise run all 6 in parallel.

Setup

Read README.md for documented behavior.
If mcp__slop_guard__* tools are available, fetch their schemas via tool search.

Fetch open issues to avoid duplicates:

bash

gh issue list --repo eric-tramel/slop-guard --limit 100 --state open --json number,title,body,labels

Prefer local QA fixtures if present. Otherwise create temp files under /tmp/slop-guard-qa. Do not write throwaway files into the repo.

Pass the open issues list into every subagent prompt.

Testing Angles

1. MCP Agent Interface (`mcp`)

Is the MCP surface clear, consistent, and useful for an agent?

Tool descriptions, parameter clarity, distinction between check_slop and check_slop_file
JSON output structure: score, band, violations, counts, advice, file
Edge cases: short text, empty strings, unicode, markdown, code fences, long inputs
Missing/unreadable file paths
Response size and parseability
Cross-check MCP vs CLI JSON output on the same input

Issue prefix: mcp:

2. CLI UX (`cli`)

Does uv run sg behave predictably?

All flags from sg --help: --json, --verbose, --quiet, --threshold, --score-only, --counts, --config, --version
Input modes: file paths, inline text, stdin (-), multiple inputs
Exit codes: 0 (pass), 1 (threshold fail), 2 (error)
Stdout/stderr separation for scripting
Path-vs-inline-text ambiguity
Missing files, bad config paths

Issue prefix: cli:

3. Fit Workflow (`fit`)

Can a user fit a custom rule config from docs alone?

Legacy shorthand: sg-fit TARGET_CORPUS OUTPUT
Multi-input mode with --output
All input formats: .jsonl, .txt, .md
--negative-dataset, --no-calibration, --init
Error cases: malformed JSONL, missing text field, invalid label, missing files, unsupported suffixes
Round-trip: is the fitted JSONL immediately usable with sg -c?

Issue prefix: fit:

4. Docs and Onboarding (`docs`)

Do install flows and docs match actual behavior?

Install paths: uvx slop-guard, uv tool install slop-guard, uv run sg
MCP setup snippets for Claude and Codex
Custom config examples
Whether documented commands, flags, and tool names match reality

Issue prefix: docs:

5. End-to-End Workflows (`workflows`)

Does slop-guard compose well for real agent work?

Lint prose via MCP or CLI, assess whether advice is actionable enough to drive a rewrite
Lint a real file via check_slop_file or sg README.md
CI gating with sg -t 60 ... — is the output machine-parseable?
Compare MCP and CLI results on the same input

Focus on cross-surface consistency, scoring coherence, and whether the product actually improves agent writing workflows.

Issue prefix: workflow:

6. Writing Effectiveness (`effectiveness`)

Does slop-guard's feedback actually make an agent produce better writing? This angle tests the quality of the feedback loop — not whether the tool runs correctly, but whether following its guidance leads to measurably improved prose.

Method: Act as an agent that writes, scores, interprets advice, rewrites, and rescores. Evaluate every step for friction, ambiguity, and actual improvement.

6a. Advice Actionability

For each test, write a paragraph in a specific register (technical docs, blog post, marketing copy, academic summary, casual explanation), score it, then attempt to follow every piece of advice literally.

Is each advice string specific enough to know what to change? ("Replace 'crucial' — what specifically do you mean?" is good; a vague suggestion would be bad)
Does the advice tell you how to fix it, or only that something is wrong?
When multiple violations overlap in the same sentence, does the combined advice make sense or contradict itself?
Are there violations with no corresponding advice entry? (The advice array should cover every fixable issue)
Does any advice lead you toward a different slop pattern? (e.g., replacing a slop word with another slop word)

6b. Feedback Loop Convergence

Run iterative rewrite cycles and track whether the score converges upward:

Write 3-5 paragraphs of deliberately sloppy AI-style prose (heavy on slop words, uniform rhythm, bold-bullet structures, contrast pairs)
Score via check_slop or sg --json
Rewrite following only the advice array — no independent judgment
Rescore the rewrite
Repeat until score stabilizes or advice is empty

Evaluate:

Does the score improve monotonically with each cycle? If not, what caused a regression?
How many cycles to reach clean band? (Should be 1-2 for light, 2-3 for moderate)
Does the advice array shrink each cycle, or do new violations appear as old ones are fixed?
Is there a score plateau where advice remains but following it doesn't move the score? What's blocking?
At convergence, does the text actually read well — or has it been "optimized" into something sterile/awkward?

6c. Score Sensitivity and Proportionality

Test whether scores reflect actual writing quality differences:

Write two versions of the same content: one natural, one AI-sloppy. Do scores separate cleanly?
Take clean prose (human-written, score >80) and introduce one slop word. How much does the score drop? Is the penalty proportional?
Take a short text (<50 words) vs a long text (~500 words) with the same density of violations. Are scores comparable?
Does the concentration penalty feel right? Write text with 1 contrast pair (should be mild) vs 5 contrast pairs (should be harsh). Check that scoring reflects this.

6d. Advice Interpretation by an Agent

Simulate how an LLM agent would parse and act on the JSON output:

Given the violations array with match and context fields, can you locate the exact position in the original text to edit? Is the 60-char context window sufficient?
When advice says "Replace X — what specifically do you mean?", does the agent have enough context from the surrounding violations to pick a good replacement?
Are the counts useful for triage? (e.g., "12 slop_words vs 1 rhythm issue" — should the agent focus on vocabulary first?)
If the agent can only make one edit, does the output help prioritize which fix has the highest impact? (Penalties vary: -10 for ai_disclosure vs -1 for a slop word)
Does the band label (light, moderate, etc.) help the agent decide whether to rewrite or accept?

6e. Register Sensitivity

Test whether slop-guard handles different writing registers fairly:

Technical documentation: Do legitimate technical terms get false-positived as slop? (e.g., "framework", "robust", "scalable" in a systems design context)
Persuasive/marketing copy: This register naturally uses more emphatic language. Does slop-guard penalize it unfairly, or does it correctly distinguish marketing voice from AI slop?
Academic writing: Hedging language ("arguably", "perhaps", "it seems") is standard in academic prose. Does the weasel rule over-penalize legitimate scholarly register?
Conversational/casual: Does informal writing score better simply because it avoids formal slop patterns, even if it's low-quality?

For each register, write a good example and a sloppy example. The tool should score good writing higher regardless of register.

6f. Rewrite Quality Assessment

After the agent completes a rewrite cycle, evaluate the output text:

Does the rewritten text preserve the original meaning and intent?
Has the rewrite introduced awkward phrasing or unnatural constructions to dodge rules?
Is the rewritten text something a human editor would accept, or does it feel "linted" — technically clean but lifeless?
Compare: original sloppy text vs rewritten text vs hand-written alternative. Where does the tool-guided rewrite land on that spectrum?

File issues for patterns where following advice consistently produces worse prose, where scores don't reflect quality, where the feedback loop stalls, or where the interface makes it hard for an agent to act on feedback.

Issue prefix: effectiveness:

Issue Filing

Before filing, check every open issue — skip if same root cause, same behavior, or strict subset. When in doubt, don't file. If related but distinct, add Related: #N.

Use existing repo labels (bug, enhancement, documentation, etc.). Each issue body must include:

Summary (user/agent perspective)
Reproduction steps (exact commands or tool calls)
Expected vs observed behavior
Severity: critical / high / medium / low
Surface: mcp / cli / fit / docs / workflow / effectiveness
Generated with [Claude Code](https://claude.com/claude-code)

For effectiveness issues: use the enhancement label unless the issue describes advice that actively misleads (then bug). Include the original text, the advice received, the rewrite attempt, and before/after scores. Concrete examples are mandatory — do not file vague "advice could be better" issues.

Summary Report

After all angles complete, compile results grouped by severity. Include: total issues filed, findings already tracked, what works well, what was not tested, and whether MCP tools were available.

Maintainer

eric-tramel Core maintainer

Source details

Full Name: eric-tramel/slop-guard
Branch: main
Path in repo: .claude/skills/qa
License: MIT License
Topics: mcp llm writing-tool writing

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

sickn33/antigravity-awesome-skills

obsidian-clipper-template-creator

Guide for creating templates for the Obsidian Web Clipper. Use when you want to create a new clipping template, understand available variables, or format clipped content.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

claude-code-expert

Especialista profundo em Claude Code - CLI da Anthropic. Maximiza produtividade com atalhos, hooks, MCPs, configuracoes avancadas, workflows, CLAUDE.md, memoria, sub-agentes, permissoes e integracao com ecossistemas.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

lex

Centralized 'Truth Engine' for cross-jurisdictional legal context (US, EU, CA) and contract scaffolding.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

odoo-inventory-optimizer

Expert guide for Odoo Inventory: stock valuation (FIFO/AVCO), reordering rules, putaway strategies, routes, and multi-warehouse configuration.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

android_ui_verification

Automated end-to-end UI testing and verification on an Android Emulator using ADB.

28,421 4,766

Explore

sickn33/antigravity-awesome-skills

seo-cannibalization-detector

Analyzes multiple provided pages to identify keyword overlap and potential cannibalization issues. Suggests differentiation strategies. Use PROACTIVELY when reviewing similar content.

28,421 4,766

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Slop-Guard QA Audit

Setup

Testing Angles

1. MCP Agent Interface (mcp)

2. CLI UX (cli)

3. Fit Workflow (fit)

4. Docs and Onboarding (docs)

5. End-to-End Workflows (workflows)

6. Writing Effectiveness (effectiveness)

6a. Advice Actionability

6b. Feedback Loop Convergence

6c. Score Sensitivity and Proportionality

6d. Advice Interpretation by an Agent

6e. Register Sensitivity

6f. Rewrite Quality Assessment

Issue Filing

Summary Report

Recommended Agent Skills

obsidian-clipper-template-creator

claude-code-expert

lex

odoo-inventory-optimizer

android_ui_verification

seo-cannibalization-detector

1. MCP Agent Interface (`mcp`)

2. CLI UX (`cli`)

3. Fit Workflow (`fit`)

4. Docs and Onboarding (`docs`)

5. End-to-End Workflows (`workflows`)

6. Writing Effectiveness (`effectiveness`)