Agent skills
literature-engineer

Agent skill

literature-engineer

Multi-route literature expansion + metadata normalization for evidence-first surveys. Produces a large candidate pool (`papers/papers_raw.jsonl`, target ≥1200) with stable IDs and provenance, ready for dedupe/rank + citation generation. **Trigger**: evidence collector, literature engineer, 文献扩充, 多路召回, snowballing, cited by, references, 元信息增强, provenance. **Use when**: 需要把候选文献扩充到 ≥1200 篇并补齐可追溯 meta（survey pipeline 的 Stage C1，写作前置 evidence）。 **Skip if**: 已经有高质量 `papers/papers_raw.jsonl`（≥1200 且每条都有稳定标识+来源记录）。 **Network**: 可离线（靠 imports）；雪崩/在线检索需要网络。 **Guardrail**: 不允许编造论文；每条记录必须带稳定标识（arXiv id / DOI / 可信 URL）和 provenance；不写 output/ prose。

View SKILL.md on GitHub Repository

Stars 377

Forks 25

Install this agent skill to your Project

npx add-skill https://github.com/WILLOSCAR/research-units-pipeline-skills/tree/main/.codex/skills/literature-engineer

SKILL.md

Literature Engineer (evidence collector)

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.

This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.

Load Order

Always read:

references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

assets/domain_packs/llm_agents.json — pinned classic/survey arXiv IDs for LLM agent topics

Script Boundary

Use scripts/run.py only for:

multi-route offline import, normalization, and provenance tagging
online arXiv/Semantic Scholar API retrieval
snowball expansion and deduplication
retrieval report generation

Do not treat run.py as the place for:

hardcoded pinned arXiv ID lists (use domain packs)
hardcoded topic detection logic (use domain packs)

Inputs

queries.md
- keywords, exclude, max_results, time window
Optional offline sources (any combination; all are merged):
- papers/import.(csv|json|jsonl|bib)
- papers/arxiv_export.(csv|json|jsonl|bib)
- papers/imports/*.(csv|json|jsonl|bib)
Optional snowball exports (offline):
- papers/snowball/*.(csv|json|jsonl|bib)

Outputs

papers/papers_raw.jsonl
- 1 record per line; minimum fields:
  - title (str), authors (list[str]), year (int|""), url (str)
  - stable identifier(s): arxiv_id and/or doi
  - abstract (str; may be empty in offline mode)
  - source (str) + provenance (list[dict])
papers/papers_raw.csv (human scan)
papers/retrieval_report.md (route counts, missing-meta stats, next actions)

Workflow (multi-route)

Offline-first merge: ingest all available offline exports (and label provenance per file).
Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning provenance.
Report: write a concise retrieval report with coverage buckets and missing-meta counts.

Quality checklist

Candidate pool size target met (A150++: ≥1200) without fabrication.
Each record has a stable identifier (arxiv_id or doi, plus url).
Each record has provenance: which route/file/API produced it.

Script

Quick Start

python .codex/skills/literature-engineer/scripts/run.py --help

All Options

See python .codex/skills/literature-engineer/scripts/run.py --help.
Reads retrieval config from queries.md.
Offline inputs (merged if present): papers/import.(csv|json|jsonl|bib), papers/arxiv_export.(csv|json|jsonl|bib), papers/imports/*.(csv|json|jsonl|bib).
Optional offline snowball inputs: papers/snowball/*.(csv|json|jsonl|bib).
Online expansion requires network: use --online and/or --snowball.
Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so ref.bib can include must-cite anchors even when keyword search misses them.
If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the r.jina.ai proxy so the pipeline can still self-boot without manual exports.
When an online run returns 0 records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).

Examples

Offline imports only:
- Put exports under papers/imports/ then run:
  - python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
Explicit offline inputs (multi-route):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
Online arXiv retrieval (needs network):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
Snowballing (needs network unless you provide offline snowball exports):
- python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

Issue: can't reach ≥1200 papers

Symptom:

papers/papers_raw.jsonl size is far below target; later stages will fail mapping/bindings and citation density.

Causes:

Only a small offline export was provided.
Network is blocked so online retrieval/snowballing can't run.

Solutions:

Provide additional exports under papers/imports/ (multiple routes/queries).
Provide snowball exports under papers/snowball/.
Enable network and rerun with --online --snowball.

Issue: many records missing stable IDs

Symptom:

Report shows many entries with empty arxiv_id and doi.

Solutions:

Prefer arXiv/OpenReview/ACL exports that include stable IDs.
If you have network, rerun with --online to backfill arXiv IDs.
Filter out ID-less entries before downstream citation generation.

Maintainer

WILLOSCAR Core maintainer

Source details

Full Name: WILLOSCAR/research-units-pipeline-skills
Branch: main
Path in repo: .codex/skills/literature-engineer
Topics: claude-code claude skills codex gpt pipeline research research-paper research-project research-tool tools units vibe vibe-coding vibecoding

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

WILLOSCAR/research-units-pipeline-skills

thesis-compile-review

对中文毕业论文进行编译、warning 分级、模板模式检查、数据与引用复查，并把问题回写成可继续迭代的 review checklist。 **Trigger**: 毕业论文编译检查, thesis compile review, warning 分级, 终稿复查, main.pdf 检查. **Use when**: 论文已经回写到 TeX 交付层，需要确认是否真正达到“可提交”的质量，而不是只做到能编译。 **Skip if**: 还处于中间层重构阶段，`chapters/*.tex` 尚未形成稳定交付稿。 **Network**: none. **Guardrail**: 不在这里重构章节主线；如果发现结构问题，明确回退到上游修复。

377 25

Explore

WILLOSCAR/research-units-pipeline-skills

front-matter-writer

Write the survey's front matter files (Abstract, Introduction, Related Work, Discussion, Conclusion) in paper voice, with high citation density and a single evidence-policy paragraph. **Trigger**: front matter writer, introduction writer, related work writer, abstract writer, discussion writer, conclusion writer, 引言, 相关工作, 摘要, 讨论, 结论. **Use when**: you are in C5 (prose allowed) and need the paper-like shell to stop the draft reading like stitched subsections. **Skip if**: `Approve C2` is missing in `DECISIONS.md`, or `citations/ref.bib` is missing. **Network**: none. **Guardrail**: no invented facts/citations; no pipeline jargon in final prose; no repeated evidence disclaimers; only use keys present in `citations/ref.bib`.

377 25

Explore

WILLOSCAR/research-units-pipeline-skills

thesis-question-list

维护中文毕业论文的 `codex_md/question_list.md`：把本轮问题、边界、优先级、协作方案和验收口径结构化，作为整条 thesis pipeline 的控制面。 **Trigger**: 毕业论文问题清单, thesis question list, 论文修改清单, 本轮目标, 结构问题梳理, review问题整理. **Use when**: 你已经有一批材料或上一轮 review 结果，需要明确这一轮到底修什么、不修什么，并给后续重构与编译复查提供统一入口。 **Skip if**: 当前只是在做一次性局部措辞修改，且没有形成新一轮结构/证据/编译问题。 **Network**: none. **Guardrail**: 不在这里写正文；不把问题单写成长篇散文；每条问题必须可执行、可验收。

377 25

Explore

WILLOSCAR/research-units-pipeline-skills

novelty-matrix

Create a novelty/prior-work matrix comparing the submission’s contributions against related work (overlaps vs deltas). **Trigger**: novelty matrix, prior-work matrix, overlap/delta, 相关工作对比, 新颖性矩阵. **Use when**: peer review 中评估 novelty/positioning，需要把贡献与相关工作逐项对齐并写出差异点证据。 **Skip if**: 缺少 claims（先跑 `claims-extractor`）或你不打算做新颖性定位分析。 **Network**: none (retrieval of additional related work is out-of-scope unless provided). **Guardrail**: 明确 overlap 与 delta；尽量给出可追溯证据来源（来自稿件/引用/作者陈述）。

377 25

Explore

WILLOSCAR/research-units-pipeline-skills

protocol-writer

Write a systematic review protocol into `output/PROTOCOL.md` (databases, queries, inclusion/exclusion, time window, extraction fields). **Trigger**: protocol, PRISMA, systematic review, inclusion/exclusion, 检索式, 纳入排除. **Use when**: systematic review pipeline 的起点（C1），需要先锁定 protocol 再开始 screening/extraction。 **Skip if**: 不是做 systematic review（或 protocol 已经锁定且不允许修改）。 **Network**: none. **Guardrail**: protocol 必须包含可执行的检索与筛选规则；需要 HUMAN 签字后才能进入 screening。

377 25

Explore

WILLOSCAR/research-units-pipeline-skills

rubric-writer

Write a rubric-based peer review report (`output/REVIEW.md`) using extracted claims and evidence gaps (novelty/soundness/clarity/impact). **Trigger**: rubric review, referee report, peer review write-up, 审稿报告, REVIEW.md. **Use when**: peer-review pipeline 的最后阶段（C3），已有 `output/CLAIMS.md` + `output/MISSING_EVIDENCE.md`（以及可选 novelty matrix）。 **Skip if**: 上游产物未就绪（claims/evidence gaps 缺失）或你不打算输出完整审稿报告。 **Network**: none. **Guardrail**: 给可执行建议（actionable feedback），并覆盖 novelty/soundness/clarity/impact；避免泛泛而谈。

377 25

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Literature Engineer (evidence collector)

Load Order

Script Boundary

Inputs

Outputs

Workflow (multi-route)

Quality checklist

Script

Quick Start

All Options

Examples

Troubleshooting

Issue: can't reach ≥1200 papers

Issue: many records missing stable IDs

Recommended Agent Skills

thesis-compile-review

front-matter-writer

thesis-question-list

novelty-matrix

protocol-writer

rubric-writer