Agent skill
website-crawler
Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base.
Install this agent skill to your Project
npx add-skill https://github.com/Uzay-G/whorl/tree/main/skills/website-crawler
SKILL.md
Website Crawler for Whorl
Crawl websites and ingest content into your whorl knowledge base.
Prerequisites
Install trafilatura if not already available:
pip install trafilatura
Single Page
Extract a single page and save to whorl docs:
# Extract content as markdown
trafilatura -u "https://example.com/page" --markdown > ~/.whorl/docs/page-name.md
Or with metadata in frontmatter:
URL="https://example.com/page"
SLUG=$(echo "$URL" | sed 's|https\?://||; s|/|_|g; s|_$||')
OUTPUT=~/.whorl/docs/"$SLUG".md
# Fetch and extract
CONTENT=$(trafilatura -u "$URL" --markdown)
TITLE=$(trafilatura -u "$URL" --json | python3 -c "import sys,json; print(json.load(sys.stdin).get('title','Untitled'))" 2>/dev/null || echo "Untitled")
# Write with frontmatter
cat > "$OUTPUT" << EOF
---
title: "$TITLE"
source_url: $URL
fetched_at: $(date -u +%Y-%m-%dT%H:%M:%SZ)
---
$CONTENT
EOF
echo "Saved to $OUTPUT"
Crawl Entire Site
Crawl up to 30 pages from a site:
trafilatura --crawl "https://example.com" --markdown -o ~/.whorl/docs/site-name/
Or with sitemap:
trafilatura --sitemap "https://example.com/sitemap.xml" --markdown -o ~/.whorl/docs/site-name/
Crawl with Custom Limit
For more control, use Python:
import os
from pathlib import Path
from datetime import datetime, timezone
import trafilatura
from trafilatura.spider import focused_crawler
WHORL_DOCS = Path.home() / ".whorl" / "docs"
site_dir = WHORL_DOCS / "my-site"
site_dir.mkdir(parents=True, exist_ok=True)
for url in focused_crawler("https://example.com", max_seen_urls=50):
downloaded = trafilatura.fetch_url(url)
if not downloaded:
continue
content = trafilatura.extract(downloaded, output_format='markdown')
metadata = trafilatura.extract_metadata(downloaded)
if not content:
continue
# Generate filename from URL
slug = url.split("//")[-1].replace("/", "_").rstrip("_")[:80]
filepath = site_dir / f"{slug}.md"
# Write with frontmatter
title = metadata.title if metadata else "Untitled"
frontmatter = f"""---
title: "{title}"
source_url: {url}
fetched_at: {datetime.now(timezone.utc).isoformat()}
---
"""
filepath.write_text(frontmatter + content)
print(f"+ {filepath.name}")
After Crawling
Run whorl sync to process new documents with ingestion agents:
whorl sync
Or if running locally without auth:
curl -X POST http://localhost:8000/api/sync
Tips
- Rate limiting: trafilatura respects robots.txt and has built-in politeness
- Deduplication: whorl's hash index will detect duplicate content
- Binary files: PDFs and images should be downloaded separately with
curl -O - Large sites: Use
max_seen_urlsto limit scope, or target specific sitemaps
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
git-guardrails-claude-code
Set up Claude Code hooks to block dangerous git commands (push, reset --hard, clean, branch -D, etc.) before they execute. Use when user wants to prevent destructive git operations, add git safety hooks, or block git push/reset in Claude Code.
migrate-to-shoehorn
Migrate test files from `as` type assertions to @total-typescript/shoehorn. Use when user mentions shoehorn, wants to replace `as` in tests, or needs partial test data.
setup-pre-commit
Set up Husky pre-commit hooks with lint-staged (Prettier), type checking, and tests in the current repo. Use when user wants to add pre-commit hooks, set up Husky, configure lint-staged, or add commit-time formatting/typechecking/testing.
obsidian-vault
Search, create, and manage notes in the Obsidian vault with wikilinks and index notes. Use when user wants to find, create, or organize notes in Obsidian.
edit-article
Edit and improve articles by restructuring sections, improving clarity, and tightening prose. Use when user wants to edit, revise, or improve an article draft.
scaffold-exercises
Create exercise directory structures with sections, problems, solutions, and explainers that pass linting. Use when user wants to scaffold exercises, create exercise stubs, or set up a new course section.
Didn't find tool you were looking for?