What are the requirements for EvalView?

EvalView requires a compatible MCP client such as Claude Desktop, Claude Code, or Cursor. No additional environment variables are needed for basic setup.

Is EvalView free to use?

Yes, EvalView is open source and free to use. You can find the source code on GitHub.

What MCP clients support EvalView?

EvalView works with any MCP-compatible client including Claude Desktop (Anthropic's official desktop app), Claude Code (CLI tool), Cursor, and other editors with MCP support.

How do I configure EvalView?

Configure EvalView by adding it to your MCP client's config file. The setup block at the top of this page generates a ready-to-paste config for Claude Code, Cursor, Codex, Windsurf, and Claude Desktop.

MCP server/ai-tools

EvalView MCP Server

Q: How do I install EvalView?

Install EvalView by running: pip install evalview

Regression testing for AI agents.

★ 69 hidai25/eval-view ↗by hidai25updated Mar 22, 2026

Add it to Claude Code

claude mcp add eval-view -- evalview mcp

Make your agent remember this setup

eval-view's config, env vars, and the gotchas you hit — recalled in every future Claude Code, Cursor, and Codex session.

npx conare@latest

Free · one command · indexes the sessions already on disk. Set up in the browser instead →

What it does

Snapshot agent behavior to create golden baselines
Detect regressions in tool calls, parameters, and output sequences
Generate visual reports for multi-turn execution traces
Support for LangGraph, CrewAI, OpenAI, Claude, and custom HTTP APIs
Continuous monitoring with optional Slack alerts

Try it

→Run the regression test suite for my current agent and report any tool path changes.

→Snapshot the current behavior of my agent as a new golden baseline.

→Compare the latest agent execution against the existing baseline and generate a report.

→Monitor my agent for regressions and alert if performance drops below the threshold.

Original README from hidai25/eval-view

Snapshot behavior, detect regressions, block broken agents before production.

EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.

  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%

Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.

Quick Start

pip install evalview

Already have a local agent running?

evalview init        # Detect agent, create starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change

No agent yet?

evalview demo        # See regression detection live (~30 seconds, no API key)

Want a real working agent?

Starter repo: evalview-support-automation-template
An LLM-backed support automation agent with built-in EvalView regression tests.

git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run

Other entry paths:

# Generate tests from a live agent
evalview generate --agent http://localhost:8000

# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke

# Capture a multi-turn conversation as one test
evalview capture --agent http://localhost:8000/invoke --multi-turn

# Generate from existing logs
evalview generate --from-log traffic.jsonl

How It Works

┌────────────┐      ┌──────────┐      ┌──────────────┐
│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │
│   (YAML)   │      │          │ ←──  │ local / cloud │
└────────────┘      └──────────┘      └──────────────┘

evalview init — detects your running agent, creates a starter test suite
evalview snapshot — runs tests, saves traces as baselines (picks judge model on first run)
evalview check — replays tests, diffs against baselines, opens HTML report with results
evalview monitor — runs checks continuously with optional Slack alerts

evalview snapshot list              # See all saved baselines
evalview snapshot show "my-test"    # Inspect a baseline
evalview snapshot delete "my-test"  # Remove a baseline
evalview snapshot --reset           # Clear all and start fresh
evalview replay                     # List tests, or: evalview replay "my-test"

Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.

Two Modes, One CLI

EvalView has two complementary ways to test your agent:

Regression Gating — "Did my agent change?"

Snapshot known-good behavior, then detect when something drifts.

evalview snapshot              # Capture current behavior as baseline
evalview check                 # Compare against baseline after every change
evalview check --judge opus    # Use a specific judge model (sonnet, gpt-5.4, deepseek...)
evalview monitor               # Continuous checks with Slack alerts

Evaluation — "How good is my agent?"

Auto-generate tests and score y

Frequently Asked Questions

What are the key features of EvalView?

Snapshot agent behavior to create golden baselines. Detect regressions in tool calls, parameters, and output sequences. Generate visual reports for multi-turn execution traces. Support for LangGraph, CrewAI, OpenAI, Claude, and custom HTTP APIs. Continuous monitoring with optional Slack alerts.

What can I use EvalView for?

Validating that model updates do not degrade agent performance or change tool usage patterns. Catching silent regressions where an agent returns a 200 status but takes the wrong logic path. Automating regression testing for support automation agents. Comparing agent outputs across different model versions or prompts.

How do I install EvalView?

Install EvalView by running: pip install evalview

What MCP clients work with EvalView?

EvalView works with any MCP-compatible client including Claude Desktop, Claude Code, Cursor, and other editors with MCP support.

Conare · memory for coding agents

Turn this server into reusable context

Keep EvalView docs, env vars, and workflow notes in Conare so your agent carries them across sessions.

Set up free$npx conare@latest