Agent skill

evaluation-anchor-checker

Audit and rewrite evaluation/numeric claims to ensure they carry minimal protocol context (task + metric + constraint) and avoid underspecified model naming. **Trigger**: evaluation anchor checker, numeric claim hygiene, underspecified numbers, protocol context, 评测锚点检查, 数字断言, 指标上下文. **Use when**: before final merge/polish, or when reviewers would likely flag claims as underspecified (numbers without task/metric/budget), or `pipeline-auditor` warns about suspicious model naming. **Skip if**: evidence is too thin to justify numeric claims (route upstream to C3/C4), or you are pre-C2 (NO PROSE). **Network**: none. **Guardrail**: do not invent numbers; do not add/remove/move citation keys; if protocol context is missing, weaken/remove the numeric claim rather than guessing.

Stars 377
Forks 25

Install this agent skill to your Project

npx add-skill https://github.com/WILLOSCAR/research-units-pipeline-skills/tree/main/.codex/skills/evaluation-anchor-checker

SKILL.md

Evaluation Anchor Checker (make numbers reviewer-safe)

Purpose: fix a reviewer-magnet failure mode in agent surveys:

  • strong numeric/performance statements appear
  • but the minimal evaluation context is missing

This skill treats numeric claims as contracts:

  • if a number stays, the same sentence must contain enough protocol context to interpret it
  • if that context is not in evidence, the claim must be downgraded (no guessing)

Inputs

Preferred (pre-merge, keeps anchoring intact):

  • the affected sections/*.md files

Optional context (read-only; helps you avoid guessing):

  • outline/writer_context_packs.jsonl (look for evaluation_anchor_minimal, evaluation_protocol, anchor_facts)
  • outline/evidence_drafts.jsonl / outline/anchor_sheet.jsonl
  • citations/ref.bib

Outputs

  • Updated sections/*.md (or output/DRAFT.md if you are post-merge), with safer evaluation anchoring
  • output/EVAL_ANCHOR_REPORT.md (always; short report with files checked / changed / weakened sentences)
  • Optional completion marker: output/eval_anchors_checked.refined.ok

Recommended slot in the survey pipeline

Use this as the last section-level numeric hygiene sweep before merge:

  • after paragraph-curator + style-harmonizer + opener-variator
  • before transition-weaver / section-merger

Reason:

  • earlier section-level rewrite passes can legitimately rephrase or fuse numeric sentences
  • if you only wait for pipeline-auditor, numeric-context issues are discovered too late in the merged draft
  • section-scoped fixes are cheaper and preserve citation anchoring better than post-merge patching

Read Order

Always read:

  • references/numeric_hygiene.md

Machine-readable asset:

  • assets/numeric_hygiene.json

The asset defines the keyword families and qualitative fallback templates. Keep the script deterministic and let the policy live in the asset/reference pair.

Role prompt: Reviewer-minded Editor (evaluation hygiene)

text
You are a reviewer-minded editor for evaluation claims in a technical survey.

Goal:
- make every numeric/performance claim interpretable and reviewer-safe

Hard constraints:
- do not invent numbers
- do not add/remove/move citation keys
- if protocol context is missing, weaken or remove the numeric claim

Minimum context to include when keeping a number:
- task / setting (what kind of task)
- metric (what is being measured)
- constraint (budget/cost/tool access/horizon/seed/logging) when relevant

Avoid:
- ambiguous model naming that looks hallucinated (e.g., “GPT-5”) unless the cited paper uses it verbatim

Workflow (explicit inputs)

  • Use outline/writer_context_packs.jsonl to locate the subsection's allowed citations and any extracted evaluation_protocol/anchor_facts.
  • Cross-check outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl for task/metric/constraint context before touching numbers.
  • Validate every cited key against citations/ref.bib (do not introduce new keys).
  • Write output/EVAL_ANCHOR_REPORT.md so the pipeline has an auditable completion artifact for this sweep.

What to enforce (the “minimum protocol trio”)

When a sentence contains digits (%, x, or numbers):

  • Keep the number only if you can attach at least 2 of the following in the same sentence without guessing:
    • task family / benchmark name
    • metric definition
    • constraint (budget, tool access, cost model, retries, horizon)

If you cannot, downgrade:

  • remove the number and rewrite as qualitative (“often”, “can”, “may”) with the same citation
  • or move the specificity into a verification target (“evaluations need to report …”) without adding new facts

Mini examples (paraphrase; do not copy)

Bad (underspecified):

  • Model X achieves 75% exact performance [@SomeBench].

Better (minimal context):

  • On <task/benchmark>, Model X reaches ~75% <metric>, under <constraint/budget/tool access> [@SomeBench].

Better (downgrade when context is missing):

  • Reported gains vary, but comparisons remain fragile when budgets and retry policies are not reported [@SomeBench].

Done checklist

  • output/EVAL_ANCHOR_REPORT.md exists and reports a non-zero file count.
  • No numeric claim remains without minimal protocol context.
  • No ambiguous model naming remains unless explicitly supported by citations.
  • Citation keys are unchanged.
  • If you removed/downgraded numbers, the paragraph still makes a defensible, evidence-bounded point.

Script

Quick Start

  • python .codex/skills/evaluation-anchor-checker/scripts/run.py --workspace workspaces/<ws>

Expand your agent's capabilities with these related and highly-rated skills.

WILLOSCAR/research-units-pipeline-skills

thesis-compile-review

对中文毕业论文进行编译、warning 分级、模板模式检查、数据与引用复查,并把问题回写成可继续迭代的 review checklist。 **Trigger**: 毕业论文编译检查, thesis compile review, warning 分级, 终稿复查, main.pdf 检查. **Use when**: 论文已经回写到 TeX 交付层,需要确认是否真正达到“可提交”的质量,而不是只做到能编译。 **Skip if**: 还处于中间层重构阶段,`chapters/*.tex` 尚未形成稳定交付稿。 **Network**: none. **Guardrail**: 不在这里重构章节主线;如果发现结构问题,明确回退到上游修复。

377 25
Explore
WILLOSCAR/research-units-pipeline-skills

front-matter-writer

Write the survey's front matter files (Abstract, Introduction, Related Work, Discussion, Conclusion) in paper voice, with high citation density and a single evidence-policy paragraph. **Trigger**: front matter writer, introduction writer, related work writer, abstract writer, discussion writer, conclusion writer, 引言, 相关工作, 摘要, 讨论, 结论. **Use when**: you are in C5 (prose allowed) and need the paper-like shell to stop the draft reading like stitched subsections. **Skip if**: `Approve C2` is missing in `DECISIONS.md`, or `citations/ref.bib` is missing. **Network**: none. **Guardrail**: no invented facts/citations; no pipeline jargon in final prose; no repeated evidence disclaimers; only use keys present in `citations/ref.bib`.

377 25
Explore
WILLOSCAR/research-units-pipeline-skills

thesis-question-list

维护中文毕业论文的 `codex_md/question_list.md`:把本轮问题、边界、优先级、协作方案和验收口径结构化,作为整条 thesis pipeline 的控制面。 **Trigger**: 毕业论文问题清单, thesis question list, 论文修改清单, 本轮目标, 结构问题梳理, review问题整理. **Use when**: 你已经有一批材料或上一轮 review 结果,需要明确这一轮到底修什么、不修什么,并给后续重构与编译复查提供统一入口。 **Skip if**: 当前只是在做一次性局部措辞修改,且没有形成新一轮结构/证据/编译问题。 **Network**: none. **Guardrail**: 不在这里写正文;不把问题单写成长篇散文;每条问题必须可执行、可验收。

377 25
Explore
WILLOSCAR/research-units-pipeline-skills

novelty-matrix

Create a novelty/prior-work matrix comparing the submission’s contributions against related work (overlaps vs deltas). **Trigger**: novelty matrix, prior-work matrix, overlap/delta, 相关工作对比, 新颖性矩阵. **Use when**: peer review 中评估 novelty/positioning,需要把贡献与相关工作逐项对齐并写出差异点证据。 **Skip if**: 缺少 claims(先跑 `claims-extractor`)或你不打算做新颖性定位分析。 **Network**: none (retrieval of additional related work is out-of-scope unless provided). **Guardrail**: 明确 overlap 与 delta;尽量给出可追溯证据来源(来自稿件/引用/作者陈述)。

377 25
Explore
WILLOSCAR/research-units-pipeline-skills

protocol-writer

Write a systematic review protocol into `output/PROTOCOL.md` (databases, queries, inclusion/exclusion, time window, extraction fields). **Trigger**: protocol, PRISMA, systematic review, inclusion/exclusion, 检索式, 纳入排除. **Use when**: systematic review pipeline 的起点(C1),需要先锁定 protocol 再开始 screening/extraction。 **Skip if**: 不是做 systematic review(或 protocol 已经锁定且不允许修改)。 **Network**: none. **Guardrail**: protocol 必须包含可执行的检索与筛选规则;需要 HUMAN 签字后才能进入 screening。

377 25
Explore
WILLOSCAR/research-units-pipeline-skills

rubric-writer

Write a rubric-based peer review report (`output/REVIEW.md`) using extracted claims and evidence gaps (novelty/soundness/clarity/impact). **Trigger**: rubric review, referee report, peer review write-up, 审稿报告, REVIEW.md. **Use when**: peer-review pipeline 的最后阶段(C3),已有 `output/CLAIMS.md` + `output/MISSING_EVIDENCE.md`(以及可选 novelty matrix)。 **Skip if**: 上游产物未就绪(claims/evidence gaps 缺失)或你不打算输出完整审稿报告。 **Network**: none. **Guardrail**: 给可执行建议(actionable feedback),并覆盖 novelty/soundness/clarity/impact;避免泛泛而谈。

377 25
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results