Agent skill
research-and-literature-gathering
Systematic workflow for finding, downloading, and indexing engineering literature by domain. Covers the full lifecycle: discovery via standards ledger and doc index, web search for open-access PDFs, download script generation, PDF validation, catalogue YAML creation, and handoff to the 7-phase document-index-pipeline for indexing. Use when populating a new engineering domain with reference literature or when a WRK item requires domain-specific standards and textbooks.
Install this agent skill to your Project
npx add-skill https://github.com/vamseeachanta/workspace-hub/tree/main/.claude/skills/data/research-and-literature-gathering
SKILL.md
Research & Literature Gathering for Engineering Domains
Full lifecycle: Discover → Search → Download → Validate → Catalogue → Index
When to Use This Skill
Trigger this skill when:
- A WRK item requires populating literature for an engineering domain
- A new domain is being set up and needs reference material
- An existing domain has gaps identified by Phase F of the document-index-pipeline
- A calculation implementation needs standards/textbooks not yet downloaded
- You see keywords like "gather literature", "download standards", "populate domain"
Engineering Domains
Active Domains (with existing literature directories)
| Domain | Literature Path | Key Standards Bodies |
|---|---|---|
| cathodic_protection | /mnt/ace-data/digitalmodel/docs/domains/cathodic_protection/literature/ | DNV, NACE, ISO |
| geotechnical | /mnt/ace-data/digitalmodel/docs/domains/geotechnical/literature/ | API, DNV, ISO |
| hydrodynamics | /mnt/ace-data/digitalmodel/docs/domains/hydrodynamics/literature/ | DNV, ITTC, SNAME |
| naval_architecture | /mnt/ace-data/digitalmodel/docs/domains/naval_architecture/literature/ | ABS, DNV, SNAME, IMO |
| pipeline | /mnt/ace-data/digitalmodel/docs/domains/pipeline/literature/ | DNV, API, ASME, BSEE |
| structural | /mnt/ace-data/digitalmodel/docs/domains/structural/literature/ | AISC, DNV, IIW, API |
| structural-parachute | /mnt/ace-data/digitalmodel/docs/domains/structural-parachute/literature/ | NHRA, SFI, NASA, AISC |
| subsea | /mnt/ace-data/digitalmodel/docs/domains/subsea/literature/ | API, DNV, BSEE |
| metocean | /mnt/ace-data/digitalmodel/docs/domains/metocean/literature/ | DNV, API, ISO, WMO |
Additional Domains (in standards but not yet in literature tree)
| Domain | Typical Target Repo |
|---|---|
| catenary | digitalmodel |
| mooring | digitalmodel |
| risers | digitalmodel |
| drilling | OGManufacturing |
| bsee | worldenergydata |
| economics | worldenergydata |
Legacy Standards Storage
The og_standards corpus lives at /mnt/ace/docs/_standards/ organized by org:
ABS, API, ASTM, BSI, DNV, ISO, MIL, NEMA, Norsok, OnePetro, Unknown.
Inventory DB: /mnt/ace/O&G-Standards/_inventory.db (SQLite, 6.8 GB).
Step-by-Step Procedure
Step 0 — Verify Domain Directory Exists
DOMAIN="geotechnical" # ← set your domain
LIT_DIR="/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature"
mkdir -p "${LIT_DIR}"
ls -la "${LIT_DIR}"
If the domain is new, create standard subdirectories:
mkdir -p "${LIT_DIR}"/{textbooks,standards,course-notes,worked-examples}
Step 1 — Query the Standards Ledger
Find standards already tracked for this domain:
uv run --no-project python scripts/data/document-index/query-ledger.py \
--domain ${DOMAIN} --verbose
Record each standard's status: gap, done, wrk_captured, reference.
Standards with gap or wrk_captured are download candidates.
Step 2 — Query the Document Index
Search the 1M+ record index for existing documents in this domain:
uv run --no-project python -c "
import json
from collections import Counter
matches = []
with open('data/document-index/index.jsonl') as f:
for line in f:
rec = json.loads(line)
path_lower = rec.get('path', '').lower()
summary_lower = (rec.get('summary') or '').lower()
if '${DOMAIN}' in path_lower or '${DOMAIN}' in summary_lower:
matches.append(rec)
print(f'Found {len(matches)} documents')
by_source = Counter(r['source'] for r in matches)
for s, c in by_source.most_common():
print(f' {s}: {c}')
"
Prioritize og_standards and ace_standards sources — these are already local.
Step 3 — Cross-Reference Capability Map
Check what calculations exist vs. gaps in the target repo:
uv run --no-project python -c "
import yaml
with open('specs/capability-map/digitalmodel.yaml') as f:
data = yaml.safe_load(f)
for m in data['modules']:
if '${DOMAIN}' in m['module'].lower():
print(f\"Module: {m['module']} ({m.get('standards_count', '?')} standards)\")
for s in m.get('standards', [])[:30]:
print(f\" {s['status']:15s} {s['org']:8s} {s['id'][:70]}\")
"
Step 4 — Web Search for Open-Access Literature
Search for freely available PDFs across these source tiers:
Tier 1 — High-value free sources:
- DNV Veracity (rules.dnv.com/docs/pdf/) — free DNV-RP, DNV-OS PDFs
- API publications (some free after registration)
- DTIC (apps.dtic.mil) — US military/govt technical reports
- NASA Technical Reports (ntrs.nasa.gov)
- BOEM/BSEE (boem.gov, bsee.gov) — regulatory guidance
- University open repos (MIT OCW, TU Delft, deepblue.lib.umich.edu)
Tier 2 — Conference/journal open access:
- OnePetro open-access papers
- ISOPE proceedings (selected)
- OTC open papers
- ResearchGate/Academia.edu author copies
Tier 3 — Textbooks and course notes:
- Internet Archive (archive.org) — public domain texts
- University course note PDFs
- Open textbook initiatives
WAF/paywall notes:
| Site | Issue | Action |
|---|---|---|
| eagle.org (ABS) | Cloudflare WAF blocks wget/curl | Add to pending_manual |
| archive.org borrow | HTTP 403 for borrow-only items | Add to pending_manual |
| IEEE Xplore | Paywalled unless institutional login | Skip or pending_manual |
| ASME Digital Collect | Paywall | Check og_standards DB |
Step 5 — Generate or Update the Download Script
Option A: Use the research-domain.py driver (queries all data sources, generates brief + script):
uv run --no-project python scripts/data/research-literature/research-domain.py \
--category ${DOMAIN} --repo digitalmodel --generate-download-script
Option B: Manual script creation from template:
#!/usr/bin/env bash
# ABOUTME: Download open-access ${DOMAIN} literature
# Usage: bash download-literature.sh [--dry-run]
set -uo pipefail
DEST="/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature"
LOG_DIR="$(git rev-parse --show-toplevel)/.claude/work-queue/assets"
LOG_FILE="${LOG_DIR}/download-${DOMAIN}.log"
DRY_RUN=false
[[ "${1:-}" == "--dry-run" ]] && DRY_RUN=true
mkdir -p "${DEST}"/{textbooks,standards,course-notes,worked-examples}
mkdir -p "${LOG_DIR}"
# shellcheck source=scripts/lib/download-helpers.sh
source "$(git rev-parse --show-toplevel)/scripts/lib/download-helpers.sh"
log "=== ${DOMAIN} Literature Download ==="
log "Destination: ${DEST}"
log "Dry run: ${DRY_RUN}"
# ─── TEXTBOOKS ────────────────────────────────
log "--- Textbooks ---"
download \
"https://example.org/textbook.pdf" \
"${DEST}/textbooks" \
"Author-Year-Short-Title.pdf"
# ─── STANDARDS ────────────────────────────────
log "--- Standards ---"
download \
"https://rules.dnv.com/docs/pdf/dnvpm/codes/docs/..." \
"${DEST}/standards" \
"DNV-RP-XXXX-Title-Year.pdf" || true
log "=== Download complete ==="
total=$(find "${DEST}" -name "*.pdf" | wc -l)
log " Total PDFs: ${total}"
Save script to: /mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature/download-literature.sh
Key script patterns:
- Always
source scripts/lib/download-helpers.shfor thedownloadandlogfunctions - Use
set -uo pipefail(NOTset -e) — download failures should log, not abort - Guard fallible downloads with
|| trueor|| log "NOTE: ..." - Use descriptive filenames:
Author-Year-Short-Title.pdf - The
downloadfunction auto-skips existing files (resume-safe) - Always run
--dry-runfirst to preview
Step 6 — Execute Downloads and Validate
# Dry run first
bash download-literature.sh --dry-run
# Execute
bash download-literature.sh
# Validate all PDFs are real PDFs (not HTML/WAF responses)
find "${LIT_DIR}" -name "*.pdf" -exec file {} \; | grep -v "PDF document"
Any file that shows "HTML document" or "ASCII text" instead of "PDF document"
is a WAF response. Move it to a _failed/ directory and add to pending_manual.
Step 7 — Create Catalogue YAML
Write knowledge/seeds/${DOMAIN}-resources.yaml:
category: ${DOMAIN}
subcategory: references
created_at: "YYYY-MM-DD"
textbooks:
- title: "Full Book Title"
author: "Author Name"
year: YYYY
local_path: "/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature/textbooks/file.pdf"
source_url: "https://..."
size_mb: N
topics: [topic1, topic2]
standards:
- title: "Standard Title"
id: "DNV-RP-XXXX"
org: "DNV"
local_path: "/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature/standards/file.pdf"
source_url: "https://..."
key_sections: ["Sec X.Y — relevant"]
online_portals:
- title: "Portal Name"
url: "https://..."
notes: "What is available"
pending_manual:
- title: "Blocked Resource"
url: "https://..."
notes: "WAF blocked / borrow-only / paywalled"
Validate against knowledge/seeds/schema.md.
Step 8 — Produce Research Brief
Save to specs/capability-map/research-briefs/${DOMAIN}.yaml:
uv run --no-project python scripts/data/research-literature/research-domain.py \
--category ${DOMAIN} --repo digitalmodel
Or produce manually following the template in the research-literature skill's
references/templates.md.
Step 9 — Trigger Document Index Pipeline
Hand off to the document-index-pipeline for indexing newly downloaded literature:
# Phase A — re-index to pick up new files
uv run --no-project python scripts/data/document-index/phase-a-index.py
# If full pipeline takes >5 min, queue it instead:
echo "uv run --no-project python scripts/data/document-index/phase-a-index.py" \
> .claude/work-queue/assets/${WRK_ID}/index-regen-queued.txt
The pipeline phases that follow:
- Phase A: scan filesystem → index.jsonl
- Phase B: LLM extraction + classification
- Phase C: domain classification
- Phase E: backpopulate index
See document-index-pipeline skill for full phase details.
Step 10 — Archive Dark Intelligence
For university coursework, worked examples, and methodology extractions:
mkdir -p knowledge/dark-intelligence/${DOMAIN}/
Save worked examples (problem statements + known answers) here for TDD test
generation. These are private resources — see dark-intelligence-workflow skill.
Pitfalls and Warnings
Download Pitfalls
-
WAF responses saved as PDF — Always validate with
file *.pdf. Sites like eagle.org (ABS) return HTML through Cloudflare WAF. Thedownloadhelper saves whatever it gets — you must check. -
set -ekills download scripts — Useset -uo pipefailwithout-e. Thedownloadfunction returns 1 on failure; with-ethe script aborts on the first 404. Guard individual downloads with|| true. -
Archive.org borrow-only — Some books show a download button but return 403. Check the item page for "Borrow" vs "Download" before scripting.
-
Duplicate downloads across domains — Check if a standard already exists in
/mnt/ace/docs/_standards/or another domain's literature before downloading. Use the og_standards SQLite to search:bashsqlite3 /mnt/ace/O&G-Standards/_inventory.db \ "SELECT path FROM documents WHERE path LIKE '%keyword%' LIMIT 20" -
Large file timeouts — The
downloadhelper useswget --timeout=60. For large textbooks (>50 MB), increase timeout or download manually.
Indexing Pitfalls
-
pdfplumber hangs on NTFS/NFS — For batch PDF processing, use
pdftotextvia subprocess, NOT pdfplumber. See WRK-1277 warning in document-index-pipeline. -
Phase A must run before Phase B — New files won't be extracted/classified until Phase A adds them to index.jsonl.
-
Budget awareness for Phase B — LLM extraction costs ~$0.002/doc (Haiku). Large domain additions (100+ docs) should be batched and budget-tracked.
Catalogue Pitfalls
-
Every attempted resource must appear — Never silently skip a failed download. It goes in
pending_manual:with a reason (WAF, paywall, borrow-only, 404). -
Validate local_path in YAML — Every
local_pathin the catalogue YAML must point to an actually existing file. Script this check:bashuv run --no-project python -c " import yaml from pathlib import Path data = yaml.safe_load(open('knowledge/seeds/${DOMAIN}-resources.yaml')) for section in ['textbooks', 'standards']: for item in data.get(section, []): p = Path(item.get('local_path', '')) status = 'OK' if p.exists() else 'MISSING' print(f' {status}: {p}') "
AC Checklist
- Domain literature directory exists with standard subdirectories
- Standards ledger queried — gap standards identified
- Document index searched — existing docs catalogued
- Capability map cross-referenced — implementation gaps noted
- Web search performed for open-access PDFs (Tiers 1-3)
- Download script created, sourcing
scripts/lib/download-helpers.sh - Download script run with
--dry-runthen executed - All downloaded PDFs validated with
file(no HTML/WAF fakes) - Catalogue YAML written at
knowledge/seeds/<domain>-resources.yaml - Research brief saved to
specs/capability-map/research-briefs/ - Failed/blocked resources recorded in
pending_manual:(none silently skipped) - Document index pipeline triggered (Phase A) or queued
- Worked examples archived in
knowledge/dark-intelligence/<domain>/
Quick Command Reference
# Query ledger for a domain
uv run --no-project python scripts/data/document-index/query-ledger.py --domain ${DOMAIN} --verbose
# Run the research driver (generates brief + optional download script)
uv run --no-project python scripts/data/research-literature/research-domain.py \
--category ${DOMAIN} --repo digitalmodel --generate-download-script
# Validate PDFs after download
find /mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature -name "*.pdf" \
-exec file {} \; | grep -v "PDF document"
# Check og_standards for existing copies
sqlite3 /mnt/ace/O&G-Standards/_inventory.db \
"SELECT path FROM documents WHERE path LIKE '%keyword%' AND is_duplicate=0 LIMIT 20"
# Re-index after adding literature
uv run --no-project python scripts/data/document-index/phase-a-index.py
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-complete-milestone
Archive completed milestone and prepare for next version
gsd-reapply-patches
Reapply local modifications after a GSD update
gsd-verify-work
Validate built features through conversational UAT
gsd-thread
Manage persistent context threads for cross-session work
clinical-trial-protocol
Generate clinical trial protocols for medical devices or drugs through a modular, waypoint-based architecture with research-only and full protocol modes.
single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations.
Didn't find tool you were looking for?