Web Scrapper Stdio MCP Server

$docker run -i --rm ghcr.io/justazul/web-scrapper-stdio
README.md

A headless web scraping server that extracts main content into Markdown.

Web Scrapper Service (MCP Stdin/Stdout)

A Python-based MCP server for robust, headless web scraping—extracts main text content from web pages and outputs Markdown, text, or HTML for seamless AI and automation integration.

Key Features

  • Headless browser scraping (Playwright, BeautifulSoup, Markdownify)
  • Outputs Markdown, text, or HTML
  • Designed for MCP (Model Context Protocol) stdio/JSON-RPC integration
  • Dockerized, with pre-built images
  • Configurable via environment variables
  • Robust error handling (timeouts, HTTP errors, Cloudflare, etc.)
  • Per-domain rate limiting
  • Easy integration with AI tools and IDEs (Cursor, Claude Desktop, Continue, JetBrains, Zed, etc.)
  • One-click install for Cursor, interactive installer for Claude

Quick Start

Run with Docker

docker run -i --rm ghcr.io/justazul/web-scrapper-stdio

One-Click Installation (Cursor IDE)


Integration with AI Tools & IDEs

This service supports integration with a wide range of AI tools and IDEs that implement the Model Context Protocol (MCP). Below are ready-to-use configuration examples for the most popular environments. Replace the image/tag as needed for custom builds.

Cursor IDE

Add to your .cursor/mcp.json (project-level) or ~/.cursor/mcp.json (global):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Claude Desktop

Add to your Claude Desktop MCP config (typically claude_desktop_config.json):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Continue (VSCode/JetBrains Plugin)

Add to your continue.config.json or via the Continue plugin MCP settings:

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

IntelliJ IDEA (JetBrains AI Assistant)

Go to Settings > Tools > AI Assistant > Model Context Protocol (MCP) and add a new server. Use:

{
  "command": "docker",
  "args": [
    "run",
    "-i",
    "--rm",
    "ghcr.io/justazul/web-scrapper-stdio"
  ]
}

Zed Editor

Add to your Zed MCP config (see Zed docs for the exact path):

{
  "mcpServers": {
    "web-scrapper-stdio": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "ghcr.io/justazul/web-scrapper-stdio"
      ]
    }
  }
}

Usage

MCP Server (Tool/Prompt)

This web scrapper is used as an MCP (Model Context Protocol) tool, allowing it to be used by AI models or other automation directly.

Tool: `scrape_web`

Parameters:

  • url (string, required): The URL to scrape
  • max_length (integer, optional): Maximum length of returned content (default: unlimited)
  • timeout_seconds (integer, optional): Timeout in seconds for the page load (default: 30)
  • user_agent (string, optional): Custom User-Agent string passed directly to the browser (defaults to a random agent)
  • wait_for_network_idle (boolean, optional): Wait for network activity to settle before scraping (default: true)
  • custom_elements_to_remove (list of strings, optional): Additional HTML elements (CSS selectors) to remove before extraction
  • grace_period_seconds (float, optional): Short grace period to allow JS to finish rendering (in seconds, default: 2.0)
  • output_format (string, optional): markdown, text, or html (default: markdown)
  • click_selector (string, optional): If provided, click the element matching this selector after navigation and before extraction

Returns:

  • Markdown formatted content extracted from the webpage, as a string
  • Errors are reported as strings starting with [ERROR] ...

**Example: Using

Tools (1)

scrape_webScrapes a URL and returns extracted content in Markdown, text, or HTML format.

Configuration

claude_desktop_config.json
{"mcpServers": {"web-scrapper-stdio": {"command": "docker", "args": ["run", "-i", "--rm", "ghcr.io/justazul/web-scrapper-stdio"]}}}

Try it

Scrape the content of https://example.com and provide it in Markdown format.
Extract the main text from this news article URL, but remove the navigation and footer elements first.
Go to this documentation page, click the 'Expand All' button using its CSS selector, and then scrape the full text.
Scrape this website and return the raw HTML content with a 60-second timeout.

Frequently Asked Questions

What are the key features of Web Scrapper Stdio?

Headless browser scraping using Playwright and BeautifulSoup. Outputs content in Markdown, plain text, or HTML formats. Per-domain rate limiting and robust error handling for Cloudflare and timeouts. Support for custom CSS selectors to remove elements or click before extraction. Configurable grace periods for JavaScript rendering and network idle states.

What can I use Web Scrapper Stdio for?

Feeding clean website content into LLMs for summarization or analysis. Automating data extraction from dynamic JavaScript-heavy web applications. Converting documentation pages into Markdown for local knowledge bases. Bypassing basic bot detection to retrieve text from protected pages.

How do I install Web Scrapper Stdio?

Install Web Scrapper Stdio by running: docker run -i --rm ghcr.io/justazul/web-scrapper-stdio

What MCP clients work with Web Scrapper Stdio?

Web Scrapper Stdio works with any MCP-compatible client including Claude Desktop, Claude Code, Cursor, and other editors with MCP support.

Use Web Scrapper Stdio with Conare

Manage MCP servers visually, upload persistent context, and never start from zero with Claude Code & Codex.

Try Free