Site Crawler MCP
A powerful Model Context Protocol (MCP) server for crawling websites and extracting assets including images and SEO metadata. Built for e-commerce sites and general web crawling needs.
Features
- Comprehensive website analysis: 12 different extraction modes for complete website insights
- Multi-mode crawling: Extract multiple data types in a single pass
- Smart extraction: Advanced pattern matching for accurate data extraction
- Performance optimized: Concurrent crawling with rate limiting
- Security analysis: HTTPS, security headers, SSL/TLS information
- SEO analysis: Complete SEO audit including meta tags, structured data, and more
- Legal compliance: KVKK, GDPR, privacy policy detection
- Business intelligence: Brand info, references, contact details extraction
Installation
From PyPI (when published)
pip install site-crawler-mcp
From Source (Development)
Using uv (Recommended)
# Clone the repository
git clone https://github.com/AndacGuven/site-crawler-mcp.git
cd site-crawler-mcp
# Create virtual environment with Python 3.12
uv venv --python 3.12
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies and package
uv sync
Using pip
# Clone the repository
git clone https://github.com/AndacGuven/site-crawler-mcp.git
cd site-crawler-mcp
# Create virtual environment (recommended)
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On Linux/Mac:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .
Usage
As an MCP Server
Add to your MCP configuration file:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json - macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Linux:
~/.config/Claude/claude_desktop_config.json
Using uvx (Recommended)
{
"mcpServers": {
"site-crawler": {
"command": "uvx",
"args": ["--from", "/path/to/site-crawler-mcp", "site-crawler-mcp"]
}
}
}
Using uv run
{
"mcpServers": {
"site-crawler": {
"command": "uv",
"args": ["run", "site_crawler"],
"cwd": "/path/to/site-crawler-mcp"
}
}
}
Using python directly
{
"mcpServers": {
"site-crawler": {
"command": "python",
"args": ["-m", "site_crawler.server"],
"cwd": "/path/to/site-crawler-mcp/src",
"env": {
"PYTHONPATH": "/path/to/site-crawler-mcp/src"
}
}
}
}
Note: Replace /path/to/site-crawler-mcp with your actual project path. On Windows, use backslashes and drive letters (e.g., C:\\Users\\YourName\\site-crawler-mcp).
Available Tools
`site_crawlAssets`
Crawl a website and extract various assets based on specified modes.
Parameters:
url(string, required): The URL to start crawling frommodes(array, required): Array of extraction modes (see below)depth(number, optional): Crawling depth (default: 1)max_pages(number, optional): Maximum pages to crawl (default: 50)
Available Modes:
images: Extract all images with metadata (alt text, dimensions, format)meta: Basic SEO metadata (title, description, H1 tags)brand: Company branding information (logo, name, about pages)seo: Comprehensive SEO analysis (meta tags, structured data, open graph)performance: Page load metrics and performance indicatorssecurity: Security headers and HTTPS configurationcompliance: Accessibility and regulatory compliance checksinfrastructure: Server technology and CDN detectionlegal: Privacy policies, terms, KVKK compliancecareers: Job opportunities and career pagesreferences: Client testimonials and case studiescontact: Contact information (email, phone, social media, address)
Example Requests:
- Basic image extraction:
{
"tool": "site_crawlAssets",
"arguments": {
"url": "https://example.com",
"modes": ["images"],
"depth": 1
}
}
- Full SEO and security audit:
{
"tool": "site_crawlAssets",
"arguments": {
"url": "https://example.com",
"modes": ["seo", "security", "performance"],
"depth": 2
}
}
- Business intelligence gathering:
{
"tool": "site_crawlAssets",
"arguments": {
"url": "https://example.com",
"modes": ["brand", "contact", "references", "careers"],
"depth": 3
}
}
- Legal compliance check:
{
"tool": "site_crawlAssets",
"arguments": {
"url": "https://example.com",
"modes": ["legal", "compliance"],
"depth": 2
}
}
Development
Requirements
- Python 3.10+
- BeautifulSoup4
- aiohttp
- MCP SDK
- uv (recommended for development)
Setup Development Environment
Using uv (Recommended)
# Clone the repository
git clone https://github.com/AndacGuven/site-
Tools 1
site_crawlAssetsCrawl a website and extract various assets based on specified modes.