Agent skill
pdf-metadata-extractor
Extracts specific metadata from PDF documents, particularly academic papers, including author information, affiliations, and institutional details. This skill is triggered by requests involving PDF analysis, academic paper processing, or when precise information extraction from document headers/footers is required. It handles PDF parsing, text extraction from specific pages, and structured data retrieval from academic document formats.
Stars
163
Forks
31
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/pdf-metadata-extractor
SKILL.md
Instructions for PDF Metadata Extraction
Primary Use Case
This skill is designed to systematically extract first author metadata from academic papers in PDF format. The core workflow involves:
- Locating and reading an Excel file containing a list of paper titles.
- Searching for and downloading the corresponding PDFs.
- Parsing the first pages of the PDFs to extract the first author's full name and their complete institutional affiliation.
- Searching for the author's Google Scholar profile.
- Writing all extracted data back to the original Excel file.
Core Workflow
1. Locate and Inspect the Target Excel File
- Use the
terminal-run_commandtool to find.xlsxor.xlsfiles within the/workspace/dumps/workspacedirectory. - Security Note: Do not use file paths outside the allowed directory (e.g.,
/dev/null). - Use
excel-get_workbook_metadatato inspect the file's structure (sheet names, used ranges). - Use
excel-read_data_from_excelto read the paper titles from the first column (typically starting at cell A2).
2. Search for and Acquire Paper PDFs
- For each paper title, perform a web search using
local-web_searchto find the PDF. Common sources include:- OpenReview (
openreview.net/pdf?id=) - arXiv (
arxiv.org/pdf/) - Conference proceedings (e.g.,
proceedings.mlr.press).
- OpenReview (
- Prioritize direct PDF links. If a direct link is not found in search results, look for a paper page (e.g., OpenReview forum) and construct the PDF URL (often by appending
/pdfto the page URL).
3. Extract Metadata from PDFs
- Use
pdf-tools-read_pdf_pagesto read the first 1-2 pages of each acquired PDF. This is where author and affiliation information is almost always located. - Parsing Strategy:
- First Author Name: Identify the first listed author in the author block, typically following the title and preceding the abstract. Extract their full name as it appears (e.g., "Aaditya K. Singh", "Amber Yijia Zheng*").
- Affiliation: Extract the complete affiliation string associated with the first author. This often includes the department, university/institution, and sometimes city/country. Capture it exactly as printed, including all listed institutions if multiple are present (e.g., "Gatsby Computational Neuroscience Unit, University College London", "ENSAE, CREST, IP Paris").
4. Find Google Scholar Profiles
- For each extracted first author, perform a targeted web search using
local-web_searchwith the query format:"<First Author Full Name>" Google Scholar profile. - Extract the direct URL to their Google Scholar citations page from the search results (e.g.,
https://scholar.google.com/citations?user=...). - If the primary search fails, try a more specific query including their institution (e.g.,
"Aaditya K. Singh" University College London Google Scholar).
5. Compile and Write Results
- Structure the extracted data (First Author, Affiliation, Google Scholar URL) into a list of lists, maintaining the same order as the input paper titles.
- Use
excel-write_data_to_excelto write this data back to the original Excel file. The data should start in column B, row 2 (next to the paper titles). - Use
excel-read_data_from_excelto verify the data was written correctly. - Finally, present a summary table to the user and call
local-claim_done.
Error Handling & Assumptions
- Missing PDFs: If a PDF cannot be found after a reasonable search, note this and proceed with the next paper. Inform the user.
- Unclear Author Block: If the first page parsing does not yield clear author/affiliation data, consider checking the second page or the very end of the PDF (footer).
- Multiple "First Authors": Some papers denote joint first authorship with asterisks (*). Extract the first name in the list that is marked as such or the very first name if no markers are present.
- Affiliation Formatting: Preserve line breaks, commas, and institutional hierarchies as they appear in the PDF. Do not simplify.
Key Tools Used
terminal-run_command: File system navigation.excel-get_workbook_metadata/excel-read_data_from_excel/excel-write_data_to_excel: Excel I/O.local-web_search: Finding papers and scholar profiles.pdf-tools-read_pdf_pages: Core PDF text extraction.local-claim_done: Signal task completion.
Didn't find tool you were looking for?