Agent skill
spider
Web crawling and scraping with analysis. Use for crawling websites, security scanning, and extracting information from web pages.
Install this agent skill to your Project
npx add-skill https://github.com/johnlindquist/claude/tree/main/skills/spider
SKILL.md
Web Spider
Crawl and analyze websites.
Prerequisites
# curl for fetching
curl --version
# Gemini for analysis
pip install google-generativeai
export GEMINI_API_KEY=your_api_key
# Optional: better HTML parsing
npm install -g puppeteer
pip install beautifulsoup4
Basic Crawling
Fetch Page
# Simple fetch
curl -s "https://example.com"
# With headers
curl -s -H "User-Agent: Mozilla/5.0" "https://example.com"
# Follow redirects
curl -sL "https://example.com"
# Save to file
curl -s "https://example.com" -o page.html
# Get headers only
curl -sI "https://example.com"
# Get headers and body
curl -sD - "https://example.com"
Extract Links
# Extract all links from a page
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sed 's/href="//;s/"$//'
# Filter to specific domain
curl -s "https://example.com" | grep -oE 'href="https?://example\.com[^"]*"'
# Unique links
curl -s "https://example.com" | grep -oE 'href="[^"]*"' | sort -u
Quick Site Scan
#!/bin/bash
URL=$1
echo "=== Scanning: $URL ==="
echo ""
echo "### Response Info ###"
curl -sI "$URL" | head -20
echo ""
echo "### Links Found ###"
curl -s "$URL" | grep -oE 'href="[^"]*"' | sort -u | head -20
echo ""
echo "### Scripts ###"
curl -s "$URL" | grep -oE 'src="[^"]*\.js[^"]*"' | sort -u
echo ""
echo "### Meta Tags ###"
curl -s "$URL" | grep -oE '<meta[^>]*>' | head -10
AI-Powered Analysis
Page Analysis
CONTENT=$(curl -s "https://example.com")
gemini -m pro -o text -e "" "Analyze this webpage:
$CONTENT
Provide:
1. What is this page about?
2. Key information/content
3. Navigation structure
4. Technical observations (framework, libraries)
5. SEO observations"
Security Scan
URL="https://example.com"
# Gather information
HEADERS=$(curl -sI "$URL")
CONTENT=$(curl -s "$URL" | head -1000)
gemini -m pro -o text -e "" "Security scan this website:
URL: $URL
HEADERS:
$HEADERS
CONTENT SAMPLE:
$CONTENT
Check for:
1. Missing security headers (CSP, HSTS, X-Frame-Options)
2. Exposed sensitive information
3. Potential vulnerabilities in scripts/forms
4. Cookie security settings
5. HTTPS configuration"
Extract Structured Data
CONTENT=$(curl -s "https://example.com/products")
gemini -m pro -o text -e "" "Extract structured data from this page:
$CONTENT
Extract into JSON format:
- Product names
- Prices
- Descriptions
- Any available metadata"
Multi-Page Crawling
Crawl and Analyze
#!/bin/bash
BASE_URL=$1
MAX_PAGES=${2:-10}
# Get initial links
LINKS=$(curl -s "$BASE_URL" | grep -oE "href=\"$BASE_URL[^\"]*\"" | sed 's/href="//;s/"$//' | sort -u | head -$MAX_PAGES)
echo "Found $(echo "$LINKS" | wc -l) pages"
for link in $LINKS; do
echo ""
echo "=== $link ==="
TITLE=$(curl -s "$link" | grep -oE '<title>[^<]*</title>' | sed 's/<[^>]*>//g')
echo "Title: $TITLE"
done
Sitemap Processing
# Fetch and parse sitemap
curl -s "https://example.com/sitemap.xml" | grep -oE '<loc>[^<]*</loc>' | sed 's/<[^>]*>//g'
# Crawl pages from sitemap
curl -s "https://example.com/sitemap.xml" | \
grep -oE '<loc>[^<]*</loc>' | \
sed 's/<[^>]*>//g' | \
while read url; do
echo "Processing: $url"
# Your processing here
done
Specific Extractions
Extract Text Content
# Remove HTML tags (basic)
curl -s "https://example.com" | sed 's/<[^>]*>//g' | tr -s ' \n'
# Using python
curl -s "https://example.com" | python3 -c "
from html.parser import HTMLParser
import sys
class TextExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.text = []
def handle_data(self, data):
self.text.append(data.strip())
p = TextExtractor()
p.feed(sys.stdin.read())
print(' '.join(filter(None, p.text)))
"
Extract API Endpoints
CONTENT=$(curl -s "https://example.com/app.js")
# Find API calls in JS
echo "$CONTENT" | grep -oE "(fetch|axios|http)\(['\"][^'\"]*['\"]" | sort -u
# AI extraction
gemini -m pro -o text -e "" "Extract API endpoints from this JavaScript:
$CONTENT
List all:
- API URLs
- HTTP methods used
- Request patterns"
Monitor Changes
#!/bin/bash
URL=$1
HASH_FILE="/tmp/page-hash-$(echo $URL | md5sum | cut -d' ' -f1)"
CURRENT=$(curl -s "$URL" | md5sum | cut -d' ' -f1)
if [ -f "$HASH_FILE" ]; then
PREVIOUS=$(cat "$HASH_FILE")
if [ "$CURRENT" != "$PREVIOUS" ]; then
echo "Page changed: $URL"
# Notify or log
fi
fi
echo "$CURRENT" > "$HASH_FILE"
Best Practices
- Respect robots.txt - Check before crawling
- Rate limit - Don't overwhelm servers
- Set User-Agent - Identify your crawler
- Handle errors - Sites go down, pages 404
- Cache responses - Don't re-fetch unnecessarily
- Be ethical - Only crawl what you're allowed to
- Check ToS - Some sites prohibit scraping
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
testgen
Generate tests using AI and run test suites. Use for generating unit tests, running coverage reports, and mutation testing.
article
Generate technical articles and documentation using AI. Use for writing blog posts, documentation, and technical content.
packx
Bundle code context for AI. ALWAYS use --limit 49k unless user explicitly requests otherwise. Use for creating shareable code bundles and preparing context for LLMs.
long-agent
Manage long-running agent sessions. Use for tracking progress in extended tasks, maintaining context across long sessions, and managing multi-step workflows.
db
Database operations for SQLite, PostgreSQL, and MySQL. Use for queries, schema inspection, migrations, and AI-assisted query generation.
investigate
Debug and investigate code issues using search and AI analysis. Use when stuck on bugs, tracing execution flow, or understanding complex code.
Didn't find tool you were looking for?