-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestwuiRelates to the browser dashboard / web UI runtimeRelates to the browser dashboard / web UI runtime
Description
Objective
Add a dashboard view that helps users diagnose why tests fail, not just that they failed. Combines trace data, failure clustering, and git correlation to surface root causes and suggest fixes.
Architecture Boundary
external-first — dashboard analysis layer. Reads existing trace.jsonl and results.jsonl data. Does not modify the eval engine.
What this enables
Currently, debugging a failed eval requires manually reading trace files and comparing runs. The root cause explorer automates this:
- Failure clustering: Group similar failures across tests and runs by error pattern
- Trace filtering: Filter traces by tool, error type, latency, token usage
- Git correlation: Link score changes to specific commits
- Fix suggestions: Based on failure patterns, suggest prompt/logic adjustments
Proposed views
Failure Overview
- Failure heatmap: tests × runs, colored by score (green/yellow/red)
- Top failure clusters with frequency and affected tests
- Score change timeline with git commit annotations
Failure Cluster Detail
- Similar failures grouped by error pattern (e.g., "tool not found", "timeout", "wrong format")
- Representative traces for each cluster
- Frequency trend: is this cluster growing or shrinking?
Trace Drill-Down
- Collapsible trace tree (extends feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563's trace explorer)
- Side-by-side: passing trace vs failing trace for same test
- Highlight divergence point: where did the failing trace go wrong?
- Token/latency overlay: spot expensive or slow steps
Git Correlation
- Score timeline with commit markers
- Click commit → see which tests regressed
- Diff view: changed files that correlate with score drops
Design Latitude
- Clustering algorithm (simple string matching, embedding-based, or LLM-assisted)
- Whether fix suggestions use LLM or pattern matching
- How to handle missing trace data (older runs without traces)
- Git integration depth (just commit hashes vs. full diff display)
Acceptance Signals
- Failures are clustered by error pattern across tests and runs
- Users can filter traces by tool, error type, latency
- Side-by-side trace comparison highlights divergence points
- Score timeline shows git commit correlation
- At least basic fix suggestions based on common failure patterns
Non-Goals
- Automated fix application (suggest only)
- Custom clustering model training
- Integration with external error tracking (Sentry, etc.)
Dependencies
- feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335 (regression detection) — provides the regression data this analyzes
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 (dashboard platform) — the UI platform this lives in
Research source
- melagiri/code-insights — pattern detection, friction point identification, root cause analysis across sessions
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestwuiRelates to the browser dashboard / web UI runtimeRelates to the browser dashboard / web UI runtime
Type
Projects
Status
Backlog