Deep Research & Web Extraction module for AI agents
CortexScout (cortex-scout) — Search and Web Extraction Engine for AI Agents
CortexScout is the Deep Research & Web Extraction module within the Cortex-Works ecosystem.
Designed for agent workloads that require token-efficient web retrieval, reliable anti-bot handling, and optional Human-in-the-Loop (HITL) fallback.
Overview
CortexScout provides a single, self-hostable Rust binary that exposes search, extraction, and stateful browser automation capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.
It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval → Chromium CDP rendering → Stateful E2E Testing → HITL workflows.
Tools (Capability Roster)
| Area | MCP Tools / Capabilities |
|---|---|
| Search | web_search, web_search_json (parallel meta-search + dedup/scoring) |
| Fetch | web_fetch, web_fetch_batch (token-efficient clean output) |
| Crawl | web_crawl (bounded discovery for doc sites / sub-pages) |
| Extraction | extract_fields, fetch_then_extract (schema-driven extraction) |
| Automation | scout_browser_automate (stateful omni-tool), scout_agent_profile_auth (HITL portal), scout_browser_close |
| Anti-bot handling | CDP rendering, proxy rotation, block-aware retries |
| HITL | visual_scout, human_auth_session, non_robot_search |
| Memory | memory_search (LanceDB-backed research history) |
| Deep research | deep_research (multi-hop search + scrape + synthesis) |
Ecosystem Integration
While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.
🎭 The "Playwright Killer" (Stateful Browser Automation)
CortexScout includes a built-in, stateful CDP automation engine designed specifically for AI Agents, completely replacing heavy frameworks like Playwright or Cypress for E2E testing workflows.
- The Silent Omni-Tool (
scout_browser_automate): Instead of calling dozens of tools, agents pass an array ofsteps(navigate, click, type, scroll, press_key, snapshot, screenshot). The entire sequence executes in a single LLM turn, saving massive amounts of context tokens. - Persistent Agent Profile: Automation runs silently in the background (
--headless=new) using a dedicated isolated profile (~/.cortex-scout/agent_profile). It maintains cookies, localStorage, and session state across tool calls without causingSingletonLockcollisions with your active desktop browser. - QA Mock & Assert Engine: Built for enterprise E2E testing. Agents can inject XHR/Fetch network interceptors (
mock_api) and run fail-fast DOM assertions (assert) that immediately halt the sequence if a UI state is incorrect. - The Agent Auth Portal (
scout_agent_profile_auth): If the silent agent encounters a CAPTCHA or complex OAuth login (like Google/Microsoft) on a new domain, this tool launches the agent's profile in a visible window. You solve the CAPTCHA once, the cookies are saved, and the agent returns to silent automation forever.
Anti-Bot Efficacy & Validation
This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.
| Target | Protection | Evidence | Notes |
|---|---|---|---|
| Cloudflare + Auth | JSON · Snippet | Auth-gated listings extraction | |
| Ticketmaster | Cloudflare Turnstile | JSON · Snippet | Challenge-handled extraction |
| Airbnb | DataDome | JSON · Snippet | Large result sets under bot controls |
| Upwork | reCAPTCHA | JSON · Snippet | Protected listings retrieval |
| Amazon | AWS Shield | JSON · Snippet | Search result extraction |
| nowsecure.nl | Cloudflare | [JSON](proof/nowsec |
Tools (5)
web_searchPerforms a parallel meta-search with deduplication and scoring.web_fetchFetches web content in a token-efficient clean output format.web_crawlPerforms bounded discovery for documentation sites or sub-pages.scout_browser_automateExecutes a sequence of browser steps in a single LLM turn.deep_researchPerforms multi-hop search, scraping, and synthesis.Environment Variables
CORTEX_SCOUT_API_KEYAPI key for premium search or proxy services if required.Configuration
{"mcpServers": {"cortex-scout": {"command": "cortex-scout", "args": []}}}