feat(dashboard): root cause explorer — trace-driven failure diagnosis

## Objective

Add a dashboard view that helps users diagnose *why* tests fail, not just *that* they failed. Combines trace data, failure clustering, and git correlation to surface root causes and suggest fixes.

## Architecture Boundary

`external-first` — dashboard analysis layer. Reads existing trace.jsonl and results.jsonl data. Does not modify the eval engine.

## What this enables

Currently, debugging a failed eval requires manually reading trace files and comparing runs. The root cause explorer automates this:

- **Failure clustering**: Group similar failures across tests and runs by error pattern
- **Trace filtering**: Filter traces by tool, error type, latency, token usage
- **Git correlation**: Link score changes to specific commits
- **Fix suggestions**: Based on failure patterns, suggest prompt/logic adjustments

## Proposed views

### Failure Overview
- Failure heatmap: tests × runs, colored by score (green/yellow/red)
- Top failure clusters with frequency and affected tests
- Score change timeline with git commit annotations

### Failure Cluster Detail
- Similar failures grouped by error pattern (e.g., "tool not found", "timeout", "wrong format")
- Representative traces for each cluster
- Frequency trend: is this cluster growing or shrinking?

### Trace Drill-Down
- Collapsible trace tree (extends #563's trace explorer)
- Side-by-side: passing trace vs failing trace for same test
- Highlight divergence point: where did the failing trace go wrong?
- Token/latency overlay: spot expensive or slow steps

### Git Correlation
- Score timeline with commit markers
- Click commit → see which tests regressed
- Diff view: changed files that correlate with score drops

## Design Latitude

- Clustering algorithm (simple string matching, embedding-based, or LLM-assisted)
- Whether fix suggestions use LLM or pattern matching
- How to handle missing trace data (older runs without traces)
- Git integration depth (just commit hashes vs. full diff display)

## Acceptance Signals

- [ ] Failures are clustered by error pattern across tests and runs
- [ ] Users can filter traces by tool, error type, latency
- [ ] Side-by-side trace comparison highlights divergence points
- [ ] Score timeline shows git commit correlation
- [ ] At least basic fix suggestions based on common failure patterns

## Non-Goals

- Automated fix application (suggest only)
- Custom clustering model training
- Integration with external error tracking (Sentry, etc.)

## Dependencies

- #335 (regression detection) — provides the regression data this analyzes
- #563 (dashboard platform) — the UI platform this lives in

## Research source

- [melagiri/code-insights](https://github.com/melagiri/code-insights) — pattern detection, friction point identification, root cause analysis across sessions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboard): root cause explorer — trace-driven failure diagnosis #786

Objective

Architecture Boundary

What this enables

Proposed views

Failure Overview

Failure Cluster Detail

Trace Drill-Down

Git Correlation

Design Latitude

Acceptance Signals

Non-Goals

Dependencies

Research source

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(dashboard): root cause explorer — trace-driven failure diagnosis #786

Description

Objective

Architecture Boundary

What this enables

Proposed views

Failure Overview

Failure Cluster Detail

Trace Drill-Down

Git Correlation

Design Latitude

Acceptance Signals

Non-Goals

Dependencies

Research source

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions