Agent skills
authenticated-web-scraper

Agent skill

authenticated-web-scraper

Scrape authenticated websites from WSL2 using Edge CDP. Launches headed Edge for user auth, then headless scraping via Chrome DevTools Protocol. Use when mirroring internal wikis, docs sites, or any site requiring 2FA/SSO login.

View SKILL.md on GitHub Repository

Stars 45

Forks 28

Install this agent skill to your Project

npx add-skill https://github.com/rysweet/amplihack/tree/main/.claude/skills/authenticated-web-scraper

SKILL.md

Authenticated Web Scraper

Purpose

Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.

When to Use

Mirroring internal documentation sites behind corporate auth
Scraping content from sites requiring 2FA/SSO that can't be automated
Extracting structured content (text, HTML, links) from authenticated web pages
Crawling site navigation trees and following links to a configurable depth

Architecture

WSL2                          Windows
┌─────────────────┐           ┌──────────────────────┐
│ Claude Code     │           │ Edge Browser          │
│                 │  kill     │ (user's profile)      │
│ 1. Kill Edge ───┼──────────>│                       │
│                 │  launch   │                       │
│ 2. Launch Edge ─┼──────────>│ --remote-debug:9222   │
│                 │           │ --debug-addr:0.0.0.0  │
│ [User auths     │           │                       │
│  in browser]    │           │ CDP WebSocket on :9222│
│                 │  cmd.exe  │                       │
│ 3. Run scraper ─┼──────────>│ node scraper.mjs      │
│                 │           │ connects localhost:9222│
│ 4. Read output <┼───────────│ writes to C:\Temp\... │
└─────────────────┘           └──────────────────────┘

Key insight: WSL2 cannot reach Windows localhost:9222 directly. The scraper script must run on the Windows side via cmd.exe /c "node script.mjs".

Quick Start

When a user asks to scrape an authenticated website:

Kill existing Edge processes and relaunch with debug flags
User authenticates in the headed browser
Copy scraper script to Windows temp and run via cmd.exe
Script connects to CDP, navigates pages, extracts content
Read results from shared filesystem (/mnt/c/Temp/...)

Core Workflow

Phase 0: Prerequisites

Node.js must be installed on Windows (cmd.exe /c "where node")
The ws npm package on Windows side (cmd.exe /c "cd C:\Temp && npm install ws")
Edge browser installed (check /mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe)

Phase 1: Launch Edge with Remote Debugging

javascript

import { execSync, spawn } from "child_process";

// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);

const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
  EDGE,
  [
    "--remote-debugging-port=9222",
    "--remote-debugging-address=0.0.0.0",
    "--remote-allow-origins=*",
    targetUrl,
  ],
  { detached: true, stdio: "ignore" }
).unref();

Phase 2: Verify CDP and User Auth

bash

# Verify CDP is running (must query from Windows side)
powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"

Tell user to authenticate, then confirm they can see content.

Phase 3: Scrape via CDP

Write a Node.js script that:

Queries http://localhost:9222/json/list for open pages
Connects to the target page via WebSocket (ws package)
Uses Runtime.evaluate to extract DOM content
Uses Page.navigate + Page.enable for crawling
Saves .txt (clean text), .html (full), _links.json per page

Run on Windows side:

bash

cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1

Phase 4: Crawl Navigation

Extract sidebar/nav links from the initial page
Filter to same-domain pages (skip anchor links)
Visit each nav page, extract content + links
Follow discovered links one level deep (deduplicating)
Write summary JSON with page inventory

CDP Command Reference

javascript

// Navigate to a page
await cdpSend(ws, "Page.navigate", { url });

// Extract text content
await cdpSend(ws, "Runtime.evaluate", {
  expression: 'document.querySelector("main").innerText',
  returnByValue: true,
});

// Extract links as JSON
await cdpSend(ws, "Runtime.evaluate", {
  expression:
    'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
  returnByValue: true,
});

// Get full HTML
await cdpSend(ws, "Runtime.evaluate", {
  expression: "document.documentElement.outerHTML",
  returnByValue: true,
});

Critical Details

Must kill Edge first: If Edge is already running, new instances join the existing process and ignore --remote-debugging-port
WSL2 networking: WSL2 has its own network stack; 127.0.0.1 in WSL does NOT reach Windows. Scripts must run on Windows via cmd.exe
Respectful crawling: Add 2-second delays between page loads
Auth persistence: Edge uses the user's default profile with saved sessions
Output path: Use Windows paths (C:\Temp\...) in scripts, read via /mnt/c/Temp/... from WSL

Integration Points

Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)
Output can be fed to other skills for analysis, summarization, or knowledge base building
Pairs well with investigation-workflow and knowledge-builder skills

Maintainer

rysweet Core maintainer

Source details

Full Name: rysweet/amplihack
Branch: main
Path in repo: .claude/skills/authenticated-web-scraper

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

rysweet/amplihack

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

45 28

Explore

rysweet/amplihack

learning-path-builder

Creates personalized learning paths for technologies, frameworks, or concepts. Use for user-interactive session only for onboarding new technologies, hackathon skill-building, or personal development planning. Not for use in automated development or investigation. Sequences resources (docs, tutorials, exercises) based on current skill level and learning goals. Adapts to learning style: hands-on, theory-first, project-based.

45 28

Explore

rysweet/amplihack

gh-work-report

Generates comprehensive GitHub activity reports across all authenticated accounts. Gathers repos, PRs, features, and themes for configurable time periods (1/5/7/30/90 days). Produces shareable markdown with tables, mermaid charts, and executive summaries. Can create a private repo with GitHub Actions automation and GitHub Pages aggregation site. Use when: "github report", "work report", "activity summary", "what did I work on", "gh-work-report", "show my github activity".

45 28

Explore

rysweet/amplihack

pr-review-assistant

Philosophy-aware PR reviews checking alignment with amplihack principles. Use when reviewing PRs to ensure ruthless simplicity, modular design, and zero-BS implementation. Suggests simplifications, identifies over-engineering, verifies brick module structure. Posts detailed, constructive review comments with specific file:line references.

45 28

Explore

rysweet/amplihack

code-smell-detector

Identifies anti-patterns specific to amplihack philosophy. Use when reviewing code for quality issues or refactoring. Detects: over-abstraction, complex inheritance, large functions (>50 lines), tight coupling, missing __all__ exports. Provides specific fixes and explanations for each smell.

45 28

Explore

rysweet/amplihack

biologist-analyst

Analyzes living systems and biological phenomena through biological lens using evolution, molecular biology, ecology, and systems biology frameworks. Provides insights on mechanisms, adaptations, interactions, and life processes. Use when: Biological systems, health issues, evolutionary questions, ecological problems, biotechnology. Evaluates: Function, structure, heredity, evolution, interactions, molecular mechanisms.

45 28

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Authenticated Web Scraper

Purpose

When to Use

Architecture

Quick Start

Core Workflow

Phase 0: Prerequisites

Phase 1: Launch Edge with Remote Debugging

Phase 2: Verify CDP and User Auth

Phase 3: Scrape via CDP

Phase 4: Crawl Navigation

CDP Command Reference

Critical Details

Integration Points

Recommended Agent Skills

chemist-analyst

learning-path-builder

gh-work-report

pr-review-assistant

code-smell-detector

biologist-analyst