Agent skill

website-crawler

Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base.

View SKILL.md on GitHub Repository

Stars 1

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/Uzay-G/whorl/tree/main/skills/website-crawler

SKILL.md

Website Crawler for Whorl

Crawl websites and ingest content into your whorl knowledge base.

Prerequisites

Install trafilatura if not already available:

bash

pip install trafilatura

Single Page

Extract a single page and save to whorl docs:

bash

# Extract content as markdown
trafilatura -u "https://example.com/page" --markdown > ~/.whorl/docs/page-name.md

Or with metadata in frontmatter:

bash

URL="https://example.com/page"
SLUG=$(echo "$URL" | sed 's|https\?://||; s|/|_|g; s|_$||')
OUTPUT=~/.whorl/docs/"$SLUG".md

# Fetch and extract
CONTENT=$(trafilatura -u "$URL" --markdown)
TITLE=$(trafilatura -u "$URL" --json | python3 -c "import sys,json; print(json.load(sys.stdin).get('title','Untitled'))" 2>/dev/null || echo "Untitled")

# Write with frontmatter
cat > "$OUTPUT" << EOF
---
title: "$TITLE"
source_url: $URL
fetched_at: $(date -u +%Y-%m-%dT%H:%M:%SZ)
---

$CONTENT
EOF

echo "Saved to $OUTPUT"

Crawl Entire Site

Crawl up to 30 pages from a site:

bash

trafilatura --crawl "https://example.com" --markdown -o ~/.whorl/docs/site-name/

Or with sitemap:

bash

trafilatura --sitemap "https://example.com/sitemap.xml" --markdown -o ~/.whorl/docs/site-name/

Crawl with Custom Limit

For more control, use Python:

python

import os
from pathlib import Path
from datetime import datetime, timezone
import trafilatura
from trafilatura.spider import focused_crawler

WHORL_DOCS = Path.home() / ".whorl" / "docs"
site_dir = WHORL_DOCS / "my-site"
site_dir.mkdir(parents=True, exist_ok=True)

for url in focused_crawler("https://example.com", max_seen_urls=50):
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        continue

    content = trafilatura.extract(downloaded, output_format='markdown')
    metadata = trafilatura.extract_metadata(downloaded)

    if not content:
        continue

    # Generate filename from URL
    slug = url.split("//")[-1].replace("/", "_").rstrip("_")[:80]
    filepath = site_dir / f"{slug}.md"

    # Write with frontmatter
    title = metadata.title if metadata else "Untitled"
    frontmatter = f"""---
title: "{title}"
source_url: {url}
fetched_at: {datetime.now(timezone.utc).isoformat()}
---

"""
    filepath.write_text(frontmatter + content)
    print(f"+ {filepath.name}")

After Crawling

Run whorl sync to process new documents with ingestion agents:

bash

whorl sync

Or if running locally without auth:

bash

curl -X POST http://localhost:8000/api/sync

Tips

Rate limiting: trafilatura respects robots.txt and has built-in politeness
Deduplication: whorl's hash index will detect duplicate content
Binary files: PDFs and images should be downloaded separately with curl -O
Large sites: Use max_seen_urls to limit scope, or target specific sitemaps

Maintainer

Uzay-G Core maintainer

Source details

Full Name: Uzay-G/whorl
Branch: main
Path in repo: skills/website-crawler

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

mattpocock/skills

git-guardrails-claude-code

Set up Claude Code hooks to block dangerous git commands (push, reset --hard, clean, branch -D, etc.) before they execute. Use when user wants to prevent destructive git operations, add git safety hooks, or block git push/reset in Claude Code.

111,310 9,758

Explore

mattpocock/skills

migrate-to-shoehorn

Migrate test files from `as` type assertions to @total-typescript/shoehorn. Use when user mentions shoehorn, wants to replace `as` in tests, or needs partial test data.

111,310 9,758

Explore

mattpocock/skills

setup-pre-commit

Set up Husky pre-commit hooks with lint-staged (Prettier), type checking, and tests in the current repo. Use when user wants to add pre-commit hooks, set up Husky, configure lint-staged, or add commit-time formatting/typechecking/testing.

111,310 9,758

Explore

mattpocock/skills

obsidian-vault

Search, create, and manage notes in the Obsidian vault with wikilinks and index notes. Use when user wants to find, create, or organize notes in Obsidian.

111,310 9,758

Explore

mattpocock/skills

edit-article

Edit and improve articles by restructuring sections, improving clarity, and tightening prose. Use when user wants to edit, revise, or improve an article draft.

111,310 9,758

Explore

mattpocock/skills

scaffold-exercises

Create exercise directory structures with sections, problems, solutions, and explainers that pass linting. Use when user wants to scaffold exercises, create exercise stubs, or set up a new course section.

111,310 9,758

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Website Crawler for Whorl

Prerequisites

Single Page

Crawl Entire Site

Crawl with Custom Limit

After Crawling

Tips

Recommended Agent Skills

git-guardrails-claude-code

migrate-to-shoehorn

setup-pre-commit

obsidian-vault

edit-article

scaffold-exercises