Agent skill

fetcher

Fetch web pages, PDFs, and documents with automatic fallbacks and content extraction. Use when user says "fetch this URL", "download this page", "crawl this website", "extract content from", "get the PDF", or provides URLs needing retrieval.

View SKILL.md on GitHub Repository

Stars 163

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/data/fetcher

Metadata

Additional technical details for this skill

short description: Web crawling and document fetching CLI

SKILL.md

Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via uvx from git (no pre-installation needed).

Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.

Simplest Usage

bash

# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

bash

./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

bash

fetcher get https://www.nasa.gov --out run/nasa

Outputs to run/nasa/:

consumer_summary.json - structured result
Walkthrough.md - human-readable summary
downloads/ - raw content files

Fetch multiple URLs

bash

# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

bash

fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

bash

fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

bash

export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

bash

export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

bash

# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

bash

# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

bash

export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

bash

# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

python

import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

python

from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

Field	Description
`url`	Original URL
`final_url`	After redirects
`content_verdict`	`ok`, `empty`, `paywall`, `error`, etc.
`text`	Extracted text content
`file_path`	Path to raw download
`markdown_path`	Path to markdown (if enabled)
`from_cache`	Whether result came from cache
`content_sha256`	Content hash for change detection

Environment Variables

Variable	Purpose
`BRAVE_API_KEY`	Enable Brave search fallbacks
`FETCHER_EMIT_MARKDOWN`	Generate LLM-friendly markdown
`FETCHER_EMIT_FIT_MARKDOWN`	Generate pruned markdown
`FETCHER_DOWNLOAD_MODE`	`text`, `download_only`, `rolling_extract`
`FETCHER_HTTP_CACHE_DISABLE`	Disable HTTP caching
`FETCHER_ENABLE_PDF_DISCOVERY`	Auto-fetch embedded PDFs

Troubleshooting

Problem	Solution
Playwright missing	`uvx --from "git+https://github.com/grahama1970/fetcher.git" playwright install chromium`
SPA page returns empty/thin	Playwright auto-fallback should trigger; check `used_playwright` in summary
Stale cached results	Set `FETCHER_HTTP_CACHE_DISABLE=1` for fresh fetch
Rate limited	Configure proxy rotation or reduce concurrency
Paywall detected	Check `content_verdict` and use alternates
Empty content	Check `junk_results.jsonl` for diagnosis

Run fetcher doctor to check environment and dependencies.

SPA/JavaScript Page Support

Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:

Check if used_playwright: 1 in consumer_summary.json
If not, the domain may need to be added to SPA_FALLBACK_DOMAINS in fetcher source
Force fresh fetch with FETCHER_HTTP_CACHE_DISABLE=1

Maintainer

majiayu000 Core maintainer

Source details

Full Name: majiayu000/claude-skill-registry
Branch: main
Path in repo: skills/data/fetcher
License: MIT License

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-spec

Manage specification documents in .agent/specs/. Use when user provides requirements, acceptance criteria, or feature descriptions that need to be tracked and validated against implementation.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-testing

Test strategy, execution, and coverage analysis. Use when designing tests, running test suites, or analyzing test results beyond baseline checks.

163 31

Explore

majiayu000/claude-skill-registry

agent-ops-state

Maintain .agent state files. Use at session start, after meaningful steps, and before concluding: read/update constitution/memory/focus/issues/baseline consistently.

163 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Fetcher - Web Crawling

Simplest Usage

Common Commands

Common Patterns

Fetch a single URL

Fetch multiple URLs

ETL mode (full control)

Check environment

Output Structure

Content Extraction

Enable markdown output

Rolling windows (for chunking)

Advanced Features

HTTP caching

PDF discovery

Proxy rotation (rate-limited sites)

Brave/Wayback fallbacks

Python API

Single URL helper

FetchResult Fields

Environment Variables

Troubleshooting

SPA/JavaScript Page Support

Recommended Agent Skills

agent-ops-spec

agent-ops-state

agent-ops-spec

agent-ops-testing

agent-ops-testing

agent-ops-state