Fetcher MCP

Fetcher MCP

Intelligent web content fetching and extraction using Playwright.

906
Stars
76
Forks
906
Watchers
7
Issues
Fetcher MCP is a server that fetches and extracts web page content using the Playwright headless browser while supporting the Model Context Protocol. It intelligently processes dynamic web pages with JavaScript, employs the Readability algorithm to extract main content, and supports output in both HTML and Markdown formats. Designed for seamless integration with AI model environments, it offers robust parallel processing, resource optimization, and flexible deployment options including Docker.

Key Features

Headless browser-based web content fetching using Playwright
JavaScript execution for handling dynamic sites
Automatic main content extraction with Readability algorithm
Supports HTML and Markdown output formats
Concurrent processing of multiple URLs
Resource blocking for performance optimization
Comprehensive error handling and logging
Configurable parameters for timeouts and extraction behavior
HTTP and SSE protocol endpoints compatible with Model Context Protocol
Easy deployment via npx or Docker

Use Cases

Fetching clean article content for AI model context ingestion
Integrating dynamic web scraping into AI chat tools and context providers
Batch processing and extraction of online articles or documentation
Reducing bandwidth and overhead in web data pipelines
Supplying structured web context for summarization or question answering systems
Extracting main content from ad-heavy and cluttered pages
Automating data curation from web resources for knowledge management
Supporting research workflows requiring up-to-date web data ingestion
Establishing browser-based content fetchers in containerized environments
Providing backend for browser automation and information extraction tools

README

中文 | Deutsch | Español | français | 日本語 | 한국어 | Português | Русский

Fetcher MCP

MCP server for fetch web page content using Playwright headless browser.

🌟 Recommended: OllaMan - Powerful Ollama AI Model Manager.

Advantages

  • JavaScript Support: Unlike traditional web scrapers, Fetcher MCP uses Playwright to execute JavaScript, making it capable of handling dynamic web content and modern web applications.

  • Intelligent Content Extraction: Built-in Readability algorithm automatically extracts the main content from web pages, removing ads, navigation, and other non-essential elements.

  • Flexible Output Format: Supports both HTML and Markdown output formats, making it easy to integrate with various downstream applications.

  • Parallel Processing: The fetch_urls tool enables concurrent fetching of multiple URLs, significantly improving efficiency for batch operations.

  • Resource Optimization: Automatically blocks unnecessary resources (images, stylesheets, fonts, media) to reduce bandwidth usage and improve performance.

  • Robust Error Handling: Comprehensive error handling and logging ensure reliable operation even when dealing with problematic web pages.

  • Configurable Parameters: Fine-grained control over timeouts, content extraction, and output formatting to suit different use cases.

Quick Start

Run directly with npx:

bash
npx -y fetcher-mcp

First time setup - install the required browser by running the following command in your terminal:

bash
npx playwright install chromium

HTTP and SSE Transport

Use the --transport=http parameter to start both Streamable HTTP endpoint and SSE endpoint services simultaneously:

bash
npx -y fetcher-mcp --log --transport=http --host=0.0.0.0 --port=3000

After startup, the server provides the following endpoints:

  • /mcp - Streamable HTTP endpoint (modern MCP protocol)
  • /sse - SSE endpoint (legacy MCP protocol)

Clients can choose which method to connect based on their needs.

Debug Mode

Run with the --debug option to show the browser window for debugging:

bash
npx -y fetcher-mcp --debug

Configuration MCP

Configure this MCP server in Claude Desktop:

On MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json

On Windows: %APPDATA%/Claude/claude_desktop_config.json

json
{
  "mcpServers": {
    "fetcher": {
      "command": "npx",
      "args": ["-y", "fetcher-mcp"]
    }
  }
}

Docker Deployment

Running with Docker

bash
docker run -p 3000:3000 ghcr.io/jae-jae/fetcher-mcp:latest

Deploying with Docker Compose

Create a docker-compose.yml file:

yaml
version: "3.8"

services:
  fetcher-mcp:
    image: ghcr.io/jae-jae/fetcher-mcp:latest
    container_name: fetcher-mcp
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
    # Using host network mode on Linux hosts can improve browser access efficiency
    # network_mode: "host"
    volumes:
      # For Playwright, may need to share certain system paths
      - /tmp:/tmp
    # Health check
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000"]
      interval: 30s
      timeout: 10s
      retries: 3

Then run:

bash
docker-compose up -d

Features

  • fetch_url - Retrieve web page content from a specified URL

    • Uses Playwright headless browser to parse JavaScript
    • Supports intelligent extraction of main content and conversion to Markdown
    • Supports the following parameters:
      • url: The URL of the web page to fetch (required parameter)
      • timeout: Page loading timeout in milliseconds, default is 30000 (30 seconds)
      • waitUntil: Specifies when navigation is considered complete, options: 'load', 'domcontentloaded', 'networkidle', 'commit', default is 'load'
      • extractContent: Whether to intelligently extract the main content, default is true
      • maxLength: Maximum length of returned content (in characters), default is no limit
      • returnHtml: Whether to return HTML content instead of Markdown, default is false
      • waitForNavigation: Whether to wait for additional navigation after initial page load (useful for sites with anti-bot verification), default is false
      • navigationTimeout: Maximum time to wait for additional navigation in milliseconds, default is 10000 (10 seconds)
      • disableMedia: Whether to disable media resources (images, stylesheets, fonts, media), default is true
      • debug: Whether to enable debug mode (showing browser window), overrides the --debug command line flag if specified
  • fetch_urls - Batch retrieve web page content from multiple URLs in parallel

    • Uses multi-tab parallel fetching for improved performance
    • Returns combined results with clear separation between webpages
    • Supports the following parameters:
      • urls: Array of URLs to fetch (required parameter)
      • Other parameters are the same as fetch_url
  • browser_install - Install Playwright Chromium browser binary automatically

    • Installs required Chromium browser binary when not available
    • Automatically suggested when browser installation errors occur
    • Supports the following parameters:
      • withDeps: Install system dependencies required by Chromium browser, default is false
      • force: Force installation even if Chromium is already installed, default is false

Tips

Handling Special Website Scenarios

Dealing with Anti-Crawler Mechanisms

  • Wait for Complete Loading: For websites using CAPTCHA, redirects, or other verification mechanisms, include in your prompt:

    Please wait for the page to fully load
    

    This will use the waitForNavigation: true parameter.

  • Increase Timeout Duration: For websites that load slowly:

    Please set the page loading timeout to 60 seconds
    

    This adjusts both timeout and navigationTimeout parameters accordingly.

Content Retrieval Adjustments

  • Preserve Original HTML Structure: When content extraction might fail:

    Please preserve the original HTML content
    

    Sets extractContent: false and returnHtml: true.

  • Fetch Complete Page Content: When extracted content is too limited:

    Please fetch the complete webpage content instead of just the main content
    

    Sets extractContent: false.

  • Return Content as HTML: When HTML format is needed instead of default Markdown:

    Please return the content in HTML format
    

    Sets returnHtml: true.

Debugging and Authentication

Enabling Debug Mode

  • Dynamic Debug Activation: To display the browser window during a specific fetch operation:
    Please enable debug mode for this fetch operation
    
    This sets debug: true even if the server was started without the --debug flag.

Using Custom Cookies for Authentication

  • Manual Login: To login using your own credentials:

    Please run in debug mode so I can manually log in to the website
    

    Sets debug: true or uses the --debug flag, keeping the browser window open for manual login.

  • Interacting with Debug Browser: When debug mode is enabled:

    1. The browser window remains open
    2. You can manually log into the website using your credentials
    3. After login is complete, content will be fetched with your authenticated session
  • Enable Debug for Specific Requests: Even if the server is already running, you can enable debug mode for a specific request:

    Please enable debug mode for this authentication step
    

    Sets debug: true for this specific request only, opening the browser window for manual login.

Development

Install Dependencies

bash
npm install

Install Playwright Browser

Install the browsers needed for Playwright:

bash
npm run install-browser

Build the Server

bash
npm run build

Debugging

Use MCP Inspector for debugging:

bash
npm run inspector

You can also enable visible browser mode for debugging:

bash
node build/index.js --debug

Related Projects

  • g-search-mcp: A powerful MCP server for Google search that enables parallel searching with multiple keywords simultaneously. Perfect for batch search operations and data collection.

License

Licensed under the MIT License

Powered by DartNode

Star History

Star History Chart

Repository Owner

jae-jae
jae-jae

User

Repository Details

Language TypeScript
Default Branch main
Size 124 KB
Contributors 2
License MIT License
MCP Verified Nov 12, 2025

Programming Languages

TypeScript
95.1%
JavaScript
3.14%
Dockerfile
1.76%

Tags

Topics

ai mcp playwright

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

We respect your privacy. Unsubscribe at any time.

Related MCPs

Discover similar Model Context Protocol servers

  • mcp-read-website-fast

    mcp-read-website-fast

    Fast, token-efficient web content extraction and Markdown conversion for AI agents.

    Provides a Model Context Protocol (MCP) compatible server that rapidly fetches web pages, removes noise, and converts content to clean Markdown with link preservation. Designed for local use by AI-powered tools like IDEs and large language models, it offers optimized token usage, concurrency, polite crawling, and smart caching. Integrates with Claude Code, VS Code, JetBrains IDEs, Cursor, and other MCP clients.

    • 111
    • MCP
    • just-every/mcp-read-website-fast
  • WebScraping.AI MCP Server

    WebScraping.AI MCP Server

    MCP server for advanced web scraping and AI-driven data extraction

    WebScraping.AI MCP Server implements the Model Context Protocol to provide web data extraction and question answering functionalities. It integrates with WebScraping.AI to offer robust tools for retrieving, rendering, and parsing web content, including structured data and natural language answers from web pages. It supports JavaScript rendering, proxy management, device emulation, and custom extraction configurations, making it suitable for both individual and team deployments in AI-assisted workflows.

    • 33
    • MCP
    • webscraping-ai/webscraping-ai-mcp-server
  • Scrapeless MCP Server

    Scrapeless MCP Server

    A real-time web integration layer for LLMs and AI agents built on the open MCP standard.

    Scrapeless MCP Server is a powerful integration layer enabling large language models, AI agents, and applications to interact with the web in real time. Built on the open Model Context Protocol, it facilitates seamless connections between models like ChatGPT, Claude, and tools such as Cursor to external web capabilities, including Google services, browser automation, and advanced data extraction. The system supports multiple transport modes and is designed to provide dynamic, real-world context to AI workflows. Robust scraping, dynamic content handling, and flexible export formats are core parts of the feature set.

    • 57
    • MCP
    • scrapeless-ai/scrapeless-mcp-server
  • Web Analyzer MCP

    Web Analyzer MCP

    Intelligent web content analysis and summarization via MCP.

    Web Analyzer MCP is an MCP-compliant server designed for intelligent web content analysis and summarization. It leverages FastMCP to perform advanced web scraping, content extraction, and AI-powered question-answering using OpenAI models. The tool integrates with various developer IDEs, offering structured markdown output, essential content extraction, and smart Q&A functionality. Its features streamline content analysis workflows and support flexible model selection.

    • 2
    • MCP
    • kimdonghwi94/web-analyzer-mcp
  • AgentQL MCP Server

    AgentQL MCP Server

    MCP-compliant server for structured web data extraction using AgentQL.

    AgentQL MCP Server acts as a Model Context Protocol (MCP) server that leverages AgentQL's data extraction capabilities to fetch structured information from web pages. It allows integration with applications supporting MCP, such as Claude Desktop, VS Code, and Cursor, by providing an accessible interface for extracting structured data based on user-defined prompts. With configurable API key support and streamlined installation, it simplifies the process of connecting web data extraction workflows to AI tools.

    • 120
    • MCP
    • tinyfish-io/agentql-mcp
  • Content Core

    Content Core

    AI-powered content extraction and processing platform with seamless model context integration.

    Content Core is an AI-driven platform for extracting, formatting, transcribing, and summarizing content from a wide variety of sources including documents, media files, web pages, images, and archives. It offers intelligent auto-detection and engine selection to optimize processing, and provides integrations via CLI, Python library, Raycast extension, macOS Services, and the Model Context Protocol (MCP). The platform supports context-aware AI summaries and direct integration with Claude through MCP for enhanced user workflows. Users can access zero-install options and benefit from enhanced processing capabilities such as advanced PDF parsing, OCR, and smart summarization.

    • 85
    • MCP
    • lfnovo/content-core
  • Didn't find tool you were looking for?

    Be as detailed as possible for better results