Agent skill
web-scrape
Intelligent web scraper with content extraction, multiple output formats, and error handling
Install this agent skill to your Project
npx add-skill https://github.com/aiskillstore/marketplace/tree/main/skills/21pounder/web-scrape
SKILL.md
Web Scraping Skill v3.0
Usage
/web-scrape <url> [options]
Options:
--format=markdown|json|text- Output format (default: markdown)--full- Include full page content (skip smart extraction)--screenshot- Also save a screenshot--scroll- Scroll to load dynamic content (infinite scroll pages)
Examples:
/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot
Execution Flow
Phase 1: Navigate and Load
1. mcp__playwright__browser_navigate
url: "<target URL>"
2. mcp__playwright__browser_wait_for
time: 2 (allow initial render)
If --scroll option: Execute scroll sequence to trigger lazy loading:
3. mcp__playwright__browser_evaluate
function: "async () => {
for (let i = 0; i < 3; i++) {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1000));
}
window.scrollTo(0, 0);
}"
Phase 2: Capture Content
4. mcp__playwright__browser_snapshot
→ Returns full accessibility tree with all text content
If --screenshot option:
5. mcp__playwright__browser_take_screenshot
filename: "scraped_<domain>_<timestamp>.png"
fullPage: true
Phase 3: Close Browser
6. mcp__playwright__browser_close
Smart Content Extraction
After getting the snapshot, apply intelligent extraction:
Step 1: Identify Content Type
| Page Type | Indicators | Extraction Strategy |
|---|---|---|
| Article/Blog | <article>, long paragraphs, date/author |
Extract main article body |
| Product Page | Price, "Add to Cart", specs | Extract title, price, description, specs |
| Documentation | Code blocks, headings hierarchy | Preserve structure and code |
| List/Search | Repeated item patterns | Extract as structured list |
| Landing Page | Hero section, CTAs | Extract key messaging |
Step 2: Filter Noise
ALWAYS REMOVE these elements from output:
- Navigation menus and breadcrumbs
- Footer content (copyright, links)
- Sidebars (ads, related articles, social links)
- Cookie banners and popups
- Comments section (unless specifically requested)
- Share buttons and social widgets
- Login/signup prompts
Step 3: Structure the Content
For Articles:
# [Title]
**Source:** [URL]
**Date:** [if available]
**Author:** [if available]
---
[Main content in clean markdown]
For Product Pages:
# [Product Name]
**Price:** [price]
**Availability:** [in stock/out of stock]
## Description
[product description]
## Specifications
| Spec | Value |
|------|-------|
| ... | ... |
Output Formats
Markdown (default)
Clean, readable markdown with proper headings, lists, and formatting.
JSON
{
"url": "https://...",
"title": "Page Title",
"type": "article|product|docs|list",
"content": {
"main": "...",
"metadata": {}
},
"extracted_at": "ISO timestamp"
}
Text
Plain text with minimal formatting, suitable for further processing.
Error Handling
Navigation Errors
| Error | Detection | Action |
|---|---|---|
| Timeout | Page doesn't load in 30s | Report error, suggest retry |
| 404 Not Found | "404" in title/content | Report "Page not found" |
| 403 Forbidden | "403", "Access Denied" | Report access restriction |
| CAPTCHA | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed |
| Paywall | "subscribe", "premium content" | Extract visible content, note paywall |
Recovery Actions
If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly
If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable
Advanced Scenarios
Single Page Applications (SPA)
1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot
Infinite Scroll Pages
1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes
Pages with Click-to-Reveal Content
1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content
Multi-page Articles
1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine
Performance Guidelines
| Metric | Target | How |
|---|---|---|
| Speed | < 15 seconds | Minimal waits, parallel where possible |
| Token Usage | < 5000 tokens | Smart extraction, not full DOM |
| Reliability | > 95% success | Proper error handling |
Security Notes
- Never execute arbitrary JavaScript from the page
- Don't follow redirects to suspicious domains
- Don't submit forms or click login buttons
- Don't scrape pages that require authentication (unless user provides credentials flow)
- Respect robots.txt when mentioned by user
Quick Reference
Minimum viable scrape (4 tool calls):
1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close
Full-featured scrape (with scroll + screenshot):
1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close
Remember: The goal is to deliver clean, useful content to the user, not raw HTML/DOM dumps.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
perigon-backend
Perigon ASP.NET Core + EF Core + Aspire conventions
perigon-agent
Pointers for Copilot/agents to apply Perigon conventions
perigon-angular
Angular 21+ standalone/Material/signal conventions for Perigon WebApp
fastapi-mastery
Comprehensive FastAPI development skill covering REST API creation, routing, request/response handling, validation, authentication, database integration, middleware, and deployment. Use when working with FastAPI projects, building APIs, implementing CRUD operations, setting up authentication/authorization, integrating databases (SQL/NoSQL), adding middleware, handling WebSockets, or deploying FastAPI applications. Triggered by requests involving .py files with FastAPI code, API endpoint creation, Pydantic models, or FastAPI-specific features.
context7-efficient
Token-efficient library documentation fetcher using Context7 MCP with 86.8% token savings through intelligent shell pipeline filtering. Fetches code examples, API references, and best practices for JavaScript, Python, Go, Rust, and other libraries. Use when users ask about library documentation, need code examples, want API usage patterns, are learning a new framework, need syntax reference, or troubleshooting with library-specific information. Triggers include questions like "Show me React hooks", "How do I use Prisma", "What's the Next.js routing syntax", or any request for library/framework documentation.
browser-use
Browser automation using Playwright MCP. Navigate websites, fill forms, click elements, take screenshots, and extract data. Use when tasks require web browsing, form submission, web scraping, UI testing, or any browser interaction.
Didn't find tool you were looking for?