Agent skill
pdf-reading
Read local PDFs to extract and verify exact numbers (counts, percentages, tables, figure captions) for papers/questions in this repository. Use this when asked to “read a PDF”, “extract results from the paper”, “verify a statistic”, or “find the exact wording in the paper”.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/pdf-reading
SKILL.md
Goal
When you need facts from a paper PDF (counts, percentages, benchmark numbers, claims, limitations), extract verbatim evidence from the PDF and compute derived values yourself.
This repository’s content often depends on exact values from tables/figures (not abstracts). Always bias toward precision and traceability.
Process
-
Locate the PDF
- Search the repo for
.pdffiles. - If a paper directory contains a source PDF, prefer that.
- If the only PDF is in
tmp/or the repo root, confirm it corresponds to the paper in question before using it.
- Search the repo for
-
Extract text locally (no network fetches)
- Prefer a local text extraction flow:
- Use
.github/skills/pdf-reading/extract_pdf_text.pyto create a plain-text copy intmp/. - If extraction fails, try a different backend (
pypdfvspdftotext) or fall back to manual inspection.
- Use
- Prefer a local text extraction flow:
-
Search within the extracted text
- Use targeted queries first (unique phrases, table titles, “Table 2”, “Appendix”, metric names).
- For numbers, search patterns like
n=,N=,(,%),Table,Figure.
-
Verify statistics (repo requirement)
- Prefer raw counts (e.g., “31/50”) over percentages when available.
- If the paper gives counts, compute percentages yourself: $\text{pct} = 100 \times \frac{\text{numerator}}{\text{denominator}}$.
- If a value is ambiguous (multiple similar tables/ablations), capture the surrounding label/context.
-
Handle common PDF pitfalls
- Hyphenation and line breaks: words may be split across lines; search both with and without hyphens.
- Tables: extracted text may be messy; search by row/column headers and unique tokens.
- Scanned PDFs: text extraction may fail; use manual reading if needed.
Output expectations
- When updating a question/paper, report the exact extracted phrase/value and where it came from (section/table/figure name).
- If you cannot reliably extract the needed value, explicitly say so and propose next steps (e.g., manual verification).
Commands
-
Extract text:
python3 .github/skills/pdf-reading/extract_pdf_text.py path/to/paper.pdf
-
Extract to a specific file:
python3 .github/skills/pdf-reading/extract_pdf_text.py path/to/paper.pdf --out tmp/paper.txt
Repository conventions to respect
- Keep diffs minimal and consistent with existing patterns.
- Park derived artifacts under
tmp/(gitignored). - Don’t add new dependencies unless explicitly requested; prefer optional tooling or clear fallbacks.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
agent-ops-spec
Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-testing
Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.
agent-ops-state
Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.
Didn't find tool you were looking for?