feat(studio): achieve full convex-evals feature parity

## Summary

Follow-up to #805 (merged in #806). The AgentV Studio scaffold is functional with run list, run detail, eval detail (Steps/Output/Task), category breakdown, eval sidebar, and failure reason display. This issue tracks the remaining features needed for full parity with the [convex-evals visualizer](https://convex-evals.netlify.app/).

## Research

- [Convex-evals UX analysis](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/ux-analysis.md) — 512-line screen-by-screen breakdown with screenshots
- [Convex-evals codebase analysis](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/codebase-analysis.md) — architecture, data model, components
- [AgentV Studio test report](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/studio-test-report.md) — browser-automated testing with 18 screenshots
- [Tech stack comparison](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/tech-stack-comparison.md) — convex-evals vs agentv stack decisions
- [Convex-evals screenshots](https://github.com/agentevals/agentevals-research/tree/main/research/findings/convex-evals/screenshots) — reference screenshots for each screen
- **Live reference**: https://convex-evals.netlify.app/ — browse with agent-browser to see target UX
- **Convex-evals source**: cloned at `/home/christso/projects/convex-evals/` (visualizer in `visualizer/src/`)

### Screenshot Map (target state for each gap)

| Gap | Reference Screenshot | What it shows |
|-----|---------------------|---------------|
| File tree (Output tab) | [`06-eval-output-tab.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/06-eval-output-tab.png), [`16-output-code-file.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/16-output-code-file.png) | File tree left panel + Monaco right panel, `.ts` file with syntax highlighting |
| Category drill-down | [`13-category-view.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/13-category-view.png) | Category page with scoped stat cards and eval list |
| Experiments tab | [`08-experiments-tab.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/08-experiments-tab.png) | Landing page Experiments tab with table |
| Targets/Models tab | [`09-models-tab.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/09-models-tab.png) | Landing page Models tab with score bars |
| Experiment detail | [`10-experiment-detail.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/10-experiment-detail.png), [`12-experiments-sidebar-detail.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/12-experiments-sidebar-detail.png) | Experiment page with sidebar run list, stat cards, run table |
| Breadcrumbs | [`05-eval-detail.png`](https://github.com/agentevals/agentevals-research/blob/main/research/findings/convex-evals/screenshots/05-eval-detail.png) | Full breadcrumb trail at top: Home > Experiment > Run > Category > Eval |

## Current State (what #806 shipped)

- [x] Landing page with run list table (Tests Passing, Mean Score columns)
- [x] Run detail with stat cards (Total, Passed, Failed, Pass Rate)
- [x] Category breakdown section with score bars and filter
- [x] Eval detail with 3-tab view (Steps, Output, Task)
- [x] Context-aware eval sidebar with pass/fail indicators
- [x] Failure reason display (red-tinted panel)
- [x] Monaco Editor for output/task viewing
- [x] Cyan→blue gradient score bars matching convex-evals aesthetic
- [x] Dark theme (bg-gray-950)
- [x] URL routing via TanStack Router
- [x] \`agentv studio\` command (with \`agentv serve\` as hidden alias)
- [x] Empty state handling ("No evaluations found")

---

## Gap 1 (High): File Tree in Output/Task Tabs

### What
Convex-evals Output tab shows a collapsible directory tree alongside Monaco Editor. Users click individual files to view them with syntax highlighting. Currently agentv dumps serialized conversation text into a single Monaco panel.

### Where to change
- **New component**: \`apps/studio/src/components/FileTree.tsx\` — collapsible tree with folder/file icons
- **Modify**: \`apps/studio/src/components/EvalDetail.tsx\` — \`OutputTab\` and \`TaskTab\` functions
- **New API endpoint**: \`GET /api/runs/:filename/evals/:evalId/files\` in \`apps/cli/src/commands/results/serve.ts\` — returns file tree from the eval's artifact directory (grading/, timing/, input/, output/ folders)
- **Reference**: convex-evals \`visualizer/src/lib/evalComponents.tsx\` and screenshot \`06-eval-output-tab.png\`

### Implementation
1. Add Hono endpoint that reads the eval's run directory, lists files in \`input/\`, \`output/\`, \`grading/\`, \`timing/\` subdirectories, and returns a tree structure: \`{ name, path, type: "file"|"dir", children? }\`
2. Create \`FileTree\` component: collapsible folders, file-type icons (use emoji like convex-evals: 📁 folder, 📘 .ts, 📋 .json, 📜 .log), click-to-select highlighting
3. Split Output/Task tab into left panel (FileTree, ~250px) + right panel (MonacoViewer)
4. On file click, fetch file content via \`GET /api/runs/:filename/evals/:evalId/files/:path\` and render in Monaco with language auto-detection from extension
5. Default selection: first file in tree, or \`run.log\` if it exists

### Red/Green Gates
- [ ] **GREEN**: Output tab renders a split layout with file tree on the left and Monaco on the right
- [ ] **GREEN**: Clicking a \`.ts\` file shows TypeScript syntax highlighting in Monaco
- [ ] **GREEN**: Clicking a \`.json\` file shows JSON syntax highlighting in Monaco
- [ ] **GREEN**: Folders are collapsible (click to expand/collapse)
- [ ] **GREEN**: File tree shows at least input and output artifact directories
- [ ] **RED**: Output tab shows raw serialized conversation text in a single panel (current state)
- [ ] **RED**: No file selection interaction exists

---

## Gap 2 (High): Category Drill-Down Page

### What
Convex-evals has a dedicated page per category (e.g., \`/experiment/no_guidelines/run/abc/Fundamentals\`) showing stat cards and eval list scoped to that category. Currently agentv only filters the flat list via category card clicks on the run detail page.

### Where to change
- **New route**: \`apps/studio/src/routes/runs/$runId.category.$category.tsx\`
- **Modify**: \`apps/studio/src/components/RunDetail.tsx\` — category cards should \`<Link>\` to the new route instead of toggling a filter
- **Modify**: \`apps/studio/src/components/Sidebar.tsx\` — on category pages, sidebar should show eval list for that category only

### Implementation
1. Create route file \`runs/$runId.category.$category.tsx\` that fetches run data and filters to evals matching the category
2. Page shows: category name as heading, stat cards (Total, Passed, Failed, Pass Rate scoped to category), eval table
3. Category cards on run detail page become \`<Link to="/runs/$runId/category/$category">\` instead of \`onClick\` filter toggle
4. Sidebar on category page shows evals in that category with pass/fail indicators

### Red/Green Gates
- [ ] **GREEN**: URL \`/runs/{runId}/category/{categoryName}\` renders a page with category-scoped stat cards and eval list
- [ ] **GREEN**: Clicking a category card on run detail navigates to \`/runs/{runId}/category/{categoryName}\`
- [ ] **GREEN**: Sidebar on category page shows only evals in that category
- [ ] **GREEN**: Browser back from category page returns to run detail
- [ ] **RED**: Category cards only toggle a client-side filter (current state)
- [ ] **RED**: No \`/runs/{runId}/category/{categoryName}\` URL exists

---

## Gap 3 (Medium): Landing Page Tabs — Experiments & Targets

### What
Convex-evals landing page has 3 tabs: Recent Runs (default), Experiments, Models. AgentV has only the run list.

### Where to change
- **Modify**: \`apps/studio/src/routes/index.tsx\` — add tab bar and tab content components
- **New components**: \`apps/studio/src/components/ExperimentsTab.tsx\`, \`apps/studio/src/components/TargetsTab.tsx\`
- **New API endpoint**: \`GET /api/experiments\` and \`GET /api/targets\` in serve.ts — aggregate across all runs

### Implementation
1. Add tab bar to landing page: "Recent Runs" | "Experiments" | "Targets" (use same tab styling as eval detail: cyan underline for active)
2. **Experiments tab**: table with columns — Experiment, Runs, Targets, Evals (passed/total), Pass Rate (score bar), Last Run. Group data by \`experiment\` field across all runs.
3. **Targets tab**: table with columns — Target, Runs, Experiments, Evals (passed/total), Pass Rate (score bar). Group data by \`target\` field across all runs.
4. API endpoints aggregate from existing run index data (no new data sources needed)
5. Rows in both tabs should be clickable, navigating to filtered views

### Red/Green Gates
- [ ] **GREEN**: Landing page shows 3 tabs: "Recent Runs", "Experiments", "Targets"
- [ ] **GREEN**: Experiments tab shows a table with at least: experiment name, run count, pass rate bar
- [ ] **GREEN**: Targets tab shows a table with at least: target name, run count, pass rate bar
- [ ] **GREEN**: Active tab has cyan underline, inactive tabs are gray
- [ ] **GREEN**: Tab state persists on page refresh (via URL query param \`?tab=experiments\`)
- [ ] **RED**: Landing page has no tabs (current state)

---

## Gap 4 (Medium): Experiment Detail Page

### What
Clicking an experiment in the Experiments tab should show a dedicated page with all runs in that experiment.

### Where to change
- **New route**: \`apps/studio/src/routes/experiments/$experimentName.tsx\`
- **Modify**: \`ExperimentsTab.tsx\` — rows link to this new route

### Implementation
1. Create route that fetches all runs, filters to those matching the experiment name
2. Page shows: experiment name as heading, stat cards (Total Runs, Completed, Pass Rate, Targets), run table (same columns as landing but scoped)
3. Sidebar shows experiment list with pass rate bars (similar to convex-evals experiment sidebar)

### Red/Green Gates
- [ ] **GREEN**: URL \`/experiments/{experimentName}\` renders a page with experiment-scoped runs
- [ ] **GREEN**: Stat cards show experiment-level aggregates
- [ ] **GREEN**: Clicking a row navigates to run detail
- [ ] **RED**: No experiment detail page exists

---

## Gap 5 (Medium): Breadcrumb Navigation

### What
Convex-evals has a full breadcrumb trail: \`Home > Experiment > Run > Category > Eval\`. Currently we show simple "Run: X / Eval: Y" text.

### Where to change
- **New component**: \`apps/studio/src/components/Breadcrumbs.tsx\`
- **Modify**: \`apps/studio/src/components/Layout.tsx\` — render breadcrumbs above page content
- Use TanStack Router's \`useMatches()\` or \`useRouterState()\` to derive breadcrumb segments from the current route

### Implementation
1. Create a \`Breadcrumbs\` component that reads the current route matches
2. Each segment is a clickable link: Home (/) > Run (timestamp) > Category (name) > Eval (testId)
3. Separator: \`>\` or \`/\` between segments
4. Last segment is non-clickable (current page)
5. Styling: gray-400 text, cyan for links, truncate long segments

### Red/Green Gates
- [ ] **GREEN**: Breadcrumb bar visible on all pages below the root
- [ ] **GREEN**: Each breadcrumb segment except the last is a clickable link
- [ ] **GREEN**: Clicking a breadcrumb navigates to that level
- [ ] **GREEN**: Breadcrumbs reflect the actual route hierarchy (not hardcoded)
- [ ] **RED**: Only "Run: X / Eval: Y" plain text shown (current state)

---

## Low Priority (do last or skip)

### Step timing badges
Add duration next to pass/fail checkmarks in assertion steps: "✓ Output contains 'Hello' (0.2s)". Check \`durationMs\` on assertion entries.
- **GREEN**: At least one step shows timing in parentheses
- **RED**: Steps show only checkmark + text

### Run metadata enrichment
Surface \`target\`, \`experiment\`, \`eval_set\` in run list and run detail headers.
- **GREEN**: Run list table has a "Target" or "Experiment" column
- **RED**: Run list only shows timestamp-based run IDs

### Top navigation bar
Persistent top nav with "AgentV Studio" logo, breadcrumbs, and tab links.
- **GREEN**: Top bar is visible on all pages with logo and navigation
- **RED**: No top bar exists (sidebar only)

### Pagination
"Load more" button or virtual scrolling for large result sets.
- **GREEN**: Run list with 50+ entries shows pagination or virtual scroll
- **RED**: All rows render at once regardless of count

---

## Implementation Notes

- All studio code lives in \`apps/studio/src/\`
- Routes use TanStack Router file-based routing in \`src/routes/\`
- Data fetching uses TanStack Query hooks in \`src/lib/api.ts\`
- The Hono API in \`apps/cli/src/commands/results/serve.ts\` may need new endpoints
- Build: \`bun --filter @agentv/studio build\`
- Test: \`bun --filter agentv test\` (353 tests)
- Lint: \`biome check apps/studio/\`
- Dark theme uses Tailwind CSS 4 utilities (bg-gray-950, text-gray-100, etc.)

## Verification Protocol

After implementing each gap, run \`agentv studio\` with test data (use \`--dry-run-delay 100\` to generate runs from examples/) and use agent-browser to screenshot each screen. Compare side-by-side with convex-evals reference screenshots in \`research/findings/convex-evals/screenshots/\`.

## Non-Goals

- Run comparison / diff view (tracked separately in #788)
- Trend charts / historical analysis (tracked in #788 sub-issues)
- Quality gates, orchestration control (Phase 3 per #788)
- Mobile responsiveness
- Real-time SSE streaming (Phase 2)

## Related

- #805 — Original scaffold issue (closed by #806)
- #788 — Tracking: AgentV Studio eval management platform
- #563 — Original platform issue


Gap	Reference Screenshot	What it shows
File tree (Output tab)	`06-eval-output-tab.png`, `16-output-code-file.png`	File tree left panel + Monaco right panel, `.ts` file with syntax highlighting
Category drill-down	`13-category-view.png`	Category page with scoped stat cards and eval list
Experiments tab	`08-experiments-tab.png`	Landing page Experiments tab with table
Targets/Models tab	`09-models-tab.png`	Landing page Models tab with score bars
Experiment detail	`10-experiment-detail.png`, `12-experiments-sidebar-detail.png`	Experiment page with sidebar run list, stat cards, run table
Breadcrumbs	`05-eval-detail.png`	Full breadcrumb trail at top: Home > Experiment > Run > Category > Eval

feat(studio): achieve full convex-evals feature parity #810

Description

Summary

Research

Screenshot Map (target state for each gap)

Current State (what #806 shipped)

Gap 1 (High): File Tree in Output/Task Tabs

What

Where to change

Implementation

Red/Green Gates

Gap 2 (High): Category Drill-Down Page

What

Where to change

Implementation

Red/Green Gates

Gap 3 (Medium): Landing Page Tabs — Experiments & Targets

What

Where to change

Implementation

Red/Green Gates

Gap 4 (Medium): Experiment Detail Page

What

Where to change

Implementation

Red/Green Gates

Gap 5 (Medium): Breadcrumb Navigation

What

Where to change

Implementation

Red/Green Gates

Low Priority (do last or skip)

Step timing badges

Run metadata enrichment

Top navigation bar

Pagination

Implementation Notes

Verification Protocol

Non-Goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions