Regression testing for AI agents.
Snapshot behavior, detect regressions, block broken agents before production.
EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately.
✓ login-flow PASSED
⚠ refund-request TOOLS_CHANGED
- lookup_order → check_policy → process_refund
+ lookup_order → check_policy → process_refund → escalate_to_human
✗ billing-dispute REGRESSION -30 pts
Score: 85 → 55 Output similarity: 35%
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns 200 but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update.
Quick Start
pip install evalview
Already have a local agent running?
evalview init # Detect agent, create starter suite
evalview snapshot # Save current behavior as baseline
evalview check # Catch regressions after every change
No agent yet?
evalview demo # See regression detection live (~30 seconds, no API key)
Want a real working agent?
Starter repo: evalview-support-automation-template
An LLM-backed support automation agent with built-in EvalView regression tests.
git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run
Other entry paths:
# Generate tests from a live agent
evalview generate --agent http://localhost:8000
# Capture real user flows via proxy
evalview capture --agent http://localhost:8000/invoke
# Capture a multi-turn conversation as one test
evalview capture --agent http://localhost:8000/invoke --multi-turn
# Generate from existing logs
evalview generate --from-log traffic.jsonl
How It Works
┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Test Cases │ ──→ │ EvalView │ ──→ │ Your Agent │
│ (YAML) │ │ │ ←── │ local / cloud │
└────────────┘ └──────────┘ └──────────────┘
evalview init— detects your running agent, creates a starter test suiteevalview snapshot— runs tests, saves traces as baselines (picks judge model on first run)evalview check— replays tests, diffs against baselines, opens HTML report with resultsevalview monitor— runs checks continuously with optional Slack alerts
evalview snapshot list # See all saved baselines
evalview snapshot show "my-test" # Inspect a baseline
evalview snapshot delete "my-test" # Remove a baseline
evalview snapshot --reset # Clear all and start fresh
evalview replay # List tests, or: evalview replay "my-test"
Your data stays local by default. Nothing leaves your machine unless you opt in to cloud sync via evalview login.
Two Modes, One CLI
EvalView has two complementary ways to test your agent:
Regression Gating — *"Did my agent change?"*
Snapshot known-good behavior, then detect when something drifts.
evalview snapshot # Capture current behavior as baseline
evalview check # Compare against baseline after every change
evalview check --judge opus # Use a specific judge model (sonnet, gpt-5.4, deepseek...)
evalview monitor # Continuous checks with Slack alerts
Evaluation — *"How good is my agent?"*
Auto-generate tests and score y
Configuration
{"mcpServers": {"evalview": {"command": "evalview", "args": ["mcp"]}}}