EvalView MCP Server

Local setup required. This server has to be cloned and prepared on your machine before you register it in Claude Code.
1

Set the server up locally

Run this once to clone and prepare the server before adding it to Claude Code.

Run in terminal
pip install evalview
2

Register it in Claude Code

After the local setup is done, run this command to point Claude Code at the built server.

Run in terminal
claude mcp add eval-view -- node "<FULL_PATH_TO_EVAL_VIEW>/dist/index.js"

Replace <FULL_PATH_TO_EVAL_VIEW>/dist/index.js with the actual folder you prepared in step 1.

README.md

Regression testing for AI agents.

Snapshot behavior, detect regressions, block broken agents before production.


EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

Quick Start

pip install evalview

Already have a local agent running?

evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

No agent yet?

evalview demo        # See regression detection live (~30 seconds, no API key)

Want a real working agent?

Starter repo: evalview-support-automation-template
An LLM-backed support automation agent with built-in EvalView regression tests.

git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run

Other entry paths:

# Generate tests from a live agent
evalview generate --agent http://localhost:8000

# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke

# Capture a multi-turn conversation as one test
evalview capture --agent http://localhost:8000/invoke --multi-turn

# Generate from existing logs
evalview generate --from-log traffic.jsonl

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘
  1. evalview init — detects your running agent, creates a starter test suite
  2. evalview snapshot — runs tests, saves traces as baselines (picks judge model on first run)
  3. evalview check — replays tests, diffs against baselines, opens HTML report with results
  4. evalview monitor — runs checks continuously with optional Slack alerts
evalview snapshot list              # See all saved baselines
evalview snapshot show "my-test"    # Inspect a baseline
evalview snapshot delete "my-test"  # Remove a baseline
evalview snapshot --reset           # Clear all and start fresh
evalview replay                     # List tests, or: evalview replay "my-test"

Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.

Two Modes, One CLI

EvalView has two complementary ways to test your agent:

Regression Gating — *"Did my agent change?"*

Snapshot known-good behavior, then detect when something drifts.

evalview snapshot              # Capture current behavior as baseline
evalview check                 # Compare against baseline after every change
evalview check --judge opus    # Use a specific judge model (sonnet, gpt-5.4, deepseek...)
evalview monitor               # Continuous checks with Slack alerts

Evaluation — *"How good is my agent?"*

Auto-generate tests and score y

Configuration

claude_desktop_config.json
{"mcpServers": {"evalview": {"command": "evalview", "args": ["mcp"]}}}

Try it

Run the regression test suite for my current agent and report any tool path changes.
Snapshot the current behavior of my agent as a new golden baseline.
Compare the latest agent execution against the existing baseline and generate a report.
Monitor my agent for regressions and alert if performance drops below the threshold.

Frequently Asked Questions

What are the key features of EvalView?

Snapshot agent behavior to create golden baselines. Detect regressions in tool calls, parameters, and output sequences. Generate visual reports for multi-turn execution traces. Support for LangGraph, CrewAI, OpenAI, Claude, and custom HTTP APIs. Continuous monitoring with optional Slack alerts.

What can I use EvalView for?

Validating that model updates do not degrade agent performance or change tool usage patterns. Catching silent regressions where an agent returns a 200 status but takes the wrong logic path. Automating regression testing for support automation agents. Comparing agent outputs across different model versions or prompts.

How do I install EvalView?

Install EvalView by running: pip install evalview

What MCP clients work with EvalView?

EvalView works with any MCP-compatible client including Claude Desktop, Claude Code, Cursor, and other editors with MCP support.

Turn this server into reusable context

Keep EvalView docs, env vars, and workflow notes in Conare so your agent carries them across sessions.

Need the old visual installer? Open Conare IDE.
Open Conare