Agent skill
authenticated-web-scraper
Scrape authenticated websites from WSL2 using Edge CDP. Launches headed Edge for user auth, then headless scraping via Chrome DevTools Protocol. Use when mirroring internal wikis, docs sites, or any site requiring 2FA/SSO login.
Install this agent skill to your Project
npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/authenticated-web-scraper
SKILL.md
Authenticated Web Scraper
Purpose
Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.
When to Use
- Mirroring internal documentation sites behind corporate auth
- Scraping content from sites requiring 2FA/SSO that can't be automated
- Extracting structured content (text, HTML, links) from authenticated web pages
- Crawling site navigation trees and following links to a configurable depth
Architecture
WSL2 Windows
┌─────────────────┐ ┌──────────────────────┐
│ Claude Code │ │ Edge Browser │
│ │ kill │ (user's profile) │
│ 1. Kill Edge ───┼──────────>│ │
│ │ launch │ │
│ 2. Launch Edge ─┼──────────>│ --remote-debug:9222 │
│ │ │ --debug-addr:0.0.0.0 │
│ [User auths │ │ │
│ in browser] │ │ CDP WebSocket on :9222│
│ │ cmd.exe │ │
│ 3. Run scraper ─┼──────────>│ node scraper.mjs │
│ │ │ connects localhost:9222│
│ 4. Read output <┼───────────│ writes to C:\Temp\... │
└─────────────────┘ └──────────────────────┘
Key insight: WSL2 cannot reach Windows localhost:9222 directly. The scraper script must run on the Windows side via cmd.exe /c "node script.mjs".
Quick Start
When a user asks to scrape an authenticated website:
- Kill existing Edge processes and relaunch with debug flags
- User authenticates in the headed browser
- Copy scraper script to Windows temp and run via
cmd.exe - Script connects to CDP, navigates pages, extracts content
- Read results from shared filesystem (
/mnt/c/Temp/...)
Core Workflow
Phase 0: Prerequisites
- Node.js must be installed on Windows (
cmd.exe /c "where node") - The
wsnpm package on Windows side (cmd.exe /c "cd C:\Temp && npm install ws") - Edge browser installed (check
/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe)
Phase 1: Launch Edge with Remote Debugging
import { execSync, spawn } from "child_process";
// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);
const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
EDGE,
[
"--remote-debugging-port=9222",
"--remote-debugging-address=0.0.0.0",
"--remote-allow-origins=*",
targetUrl,
],
{ detached: true, stdio: "ignore" }
).unref();
Phase 2: Verify CDP and User Auth
# Verify CDP is running (must query from Windows side)
powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"
Tell user to authenticate, then confirm they can see content.
Phase 3: Scrape via CDP
Write a Node.js script that:
- Queries
http://localhost:9222/json/listfor open pages - Connects to the target page via WebSocket (
wspackage) - Uses
Runtime.evaluateto extract DOM content - Uses
Page.navigate+Page.enablefor crawling - Saves
.txt(clean text),.html(full),_links.jsonper page
Run on Windows side:
cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1
Phase 4: Crawl Navigation
- Extract sidebar/nav links from the initial page
- Filter to same-domain pages (skip anchor links)
- Visit each nav page, extract content + links
- Follow discovered links one level deep (deduplicating)
- Write summary JSON with page inventory
CDP Command Reference
// Navigate to a page
await cdpSend(ws, "Page.navigate", { url });
// Extract text content
await cdpSend(ws, "Runtime.evaluate", {
expression: 'document.querySelector("main").innerText',
returnByValue: true,
});
// Extract links as JSON
await cdpSend(ws, "Runtime.evaluate", {
expression:
'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
returnByValue: true,
});
// Get full HTML
await cdpSend(ws, "Runtime.evaluate", {
expression: "document.documentElement.outerHTML",
returnByValue: true,
});
Critical Details
- Must kill Edge first: If Edge is already running, new instances join the existing process and ignore
--remote-debugging-port - WSL2 networking: WSL2 has its own network stack;
127.0.0.1in WSL does NOT reach Windows. Scripts must run on Windows viacmd.exe - Respectful crawling: Add 2-second delays between page loads
- Auth persistence: Edge uses the user's default profile with saved sessions
- Output path: Use Windows paths (
C:\Temp\...) in scripts, read via/mnt/c/Temp/...from WSL
Integration Points
- Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)
- Output can be fed to other skills for analysis, summarization, or knowledge base building
- Pairs well with
investigation-workflowandknowledge-builderskills
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
learning-path-builder
Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.
gh-work-report
Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".
pr-review-assistant
Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.
code-smell-detector
Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.
biologist-analyst
Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.
Didn't find tool you were looking for?