Skip to content

feat(studio): achieve full convex-evals feature parity #810

@christso

Description

@christso

Summary

Follow-up to #805 (merged in #806). The AgentV Studio scaffold is functional with run list, run detail, eval detail (Steps/Output/Task), category breakdown, eval sidebar, and failure reason display. This issue tracks the remaining features needed for full parity with the convex-evals visualizer.

Research

Screenshot Map (target state for each gap)

Gap Reference Screenshot What it shows
File tree (Output tab) 06-eval-output-tab.png, 16-output-code-file.png File tree left panel + Monaco right panel, .ts file with syntax highlighting
Category drill-down 13-category-view.png Category page with scoped stat cards and eval list
Experiments tab 08-experiments-tab.png Landing page Experiments tab with table
Targets/Models tab 09-models-tab.png Landing page Models tab with score bars
Experiment detail 10-experiment-detail.png, 12-experiments-sidebar-detail.png Experiment page with sidebar run list, stat cards, run table
Breadcrumbs 05-eval-detail.png Full breadcrumb trail at top: Home > Experiment > Run > Category > Eval

Current State (what #806 shipped)

  • Landing page with run list table (Tests Passing, Mean Score columns)
  • Run detail with stat cards (Total, Passed, Failed, Pass Rate)
  • Category breakdown section with score bars and filter
  • Eval detail with 3-tab view (Steps, Output, Task)
  • Context-aware eval sidebar with pass/fail indicators
  • Failure reason display (red-tinted panel)
  • Monaco Editor for output/task viewing
  • Cyan→blue gradient score bars matching convex-evals aesthetic
  • Dark theme (bg-gray-950)
  • URL routing via TanStack Router
  • `agentv studio` command (with `agentv serve` as hidden alias)
  • Empty state handling ("No evaluations found")

Gap 1 (High): File Tree in Output/Task Tabs

What

Convex-evals Output tab shows a collapsible directory tree alongside Monaco Editor. Users click individual files to view them with syntax highlighting. Currently agentv dumps serialized conversation text into a single Monaco panel.

Where to change

  • New component: `apps/studio/src/components/FileTree.tsx` — collapsible tree with folder/file icons
  • Modify: `apps/studio/src/components/EvalDetail.tsx` — `OutputTab` and `TaskTab` functions
  • New API endpoint: `GET /api/runs/:filename/evals/:evalId/files` in `apps/cli/src/commands/results/serve.ts` — returns file tree from the eval's artifact directory (grading/, timing/, input/, output/ folders)
  • Reference: convex-evals `visualizer/src/lib/evalComponents.tsx` and screenshot `06-eval-output-tab.png`

Implementation

  1. Add Hono endpoint that reads the eval's run directory, lists files in `input/`, `output/`, `grading/`, `timing/` subdirectories, and returns a tree structure: `{ name, path, type: "file"|"dir", children? }`
  2. Create `FileTree` component: collapsible folders, file-type icons (use emoji like convex-evals: 📁 folder, 📘 .ts, 📋 .json, 📜 .log), click-to-select highlighting
  3. Split Output/Task tab into left panel (FileTree, ~250px) + right panel (MonacoViewer)
  4. On file click, fetch file content via `GET /api/runs/:filename/evals/:evalId/files/:path` and render in Monaco with language auto-detection from extension
  5. Default selection: first file in tree, or `run.log` if it exists

Red/Green Gates

  • GREEN: Output tab renders a split layout with file tree on the left and Monaco on the right
  • GREEN: Clicking a `.ts` file shows TypeScript syntax highlighting in Monaco
  • GREEN: Clicking a `.json` file shows JSON syntax highlighting in Monaco
  • GREEN: Folders are collapsible (click to expand/collapse)
  • GREEN: File tree shows at least input and output artifact directories
  • RED: Output tab shows raw serialized conversation text in a single panel (current state)
  • RED: No file selection interaction exists

Gap 2 (High): Category Drill-Down Page

What

Convex-evals has a dedicated page per category (e.g., `/experiment/no_guidelines/run/abc/Fundamentals`) showing stat cards and eval list scoped to that category. Currently agentv only filters the flat list via category card clicks on the run detail page.

Where to change

  • New route: `apps/studio/src/routes/runs/$runId.category.$category.tsx`
  • Modify: `apps/studio/src/components/RunDetail.tsx` — category cards should `` to the new route instead of toggling a filter
  • Modify: `apps/studio/src/components/Sidebar.tsx` — on category pages, sidebar should show eval list for that category only

Implementation

  1. Create route file `runs/$runId.category.$category.tsx` that fetches run data and filters to evals matching the category
  2. Page shows: category name as heading, stat cards (Total, Passed, Failed, Pass Rate scoped to category), eval table
  3. Category cards on run detail page become `` instead of `onClick` filter toggle
  4. Sidebar on category page shows evals in that category with pass/fail indicators

Red/Green Gates

  • GREEN: URL `/runs/{runId}/category/{categoryName}` renders a page with category-scoped stat cards and eval list
  • GREEN: Clicking a category card on run detail navigates to `/runs/{runId}/category/{categoryName}`
  • GREEN: Sidebar on category page shows only evals in that category
  • GREEN: Browser back from category page returns to run detail
  • RED: Category cards only toggle a client-side filter (current state)
  • RED: No `/runs/{runId}/category/{categoryName}` URL exists

Gap 3 (Medium): Landing Page Tabs — Experiments & Targets

What

Convex-evals landing page has 3 tabs: Recent Runs (default), Experiments, Models. AgentV has only the run list.

Where to change

  • Modify: `apps/studio/src/routes/index.tsx` — add tab bar and tab content components
  • New components: `apps/studio/src/components/ExperimentsTab.tsx`, `apps/studio/src/components/TargetsTab.tsx`
  • New API endpoint: `GET /api/experiments` and `GET /api/targets` in serve.ts — aggregate across all runs

Implementation

  1. Add tab bar to landing page: "Recent Runs" | "Experiments" | "Targets" (use same tab styling as eval detail: cyan underline for active)
  2. Experiments tab: table with columns — Experiment, Runs, Targets, Evals (passed/total), Pass Rate (score bar), Last Run. Group data by `experiment` field across all runs.
  3. Targets tab: table with columns — Target, Runs, Experiments, Evals (passed/total), Pass Rate (score bar). Group data by `target` field across all runs.
  4. API endpoints aggregate from existing run index data (no new data sources needed)
  5. Rows in both tabs should be clickable, navigating to filtered views

Red/Green Gates

  • GREEN: Landing page shows 3 tabs: "Recent Runs", "Experiments", "Targets"
  • GREEN: Experiments tab shows a table with at least: experiment name, run count, pass rate bar
  • GREEN: Targets tab shows a table with at least: target name, run count, pass rate bar
  • GREEN: Active tab has cyan underline, inactive tabs are gray
  • GREEN: Tab state persists on page refresh (via URL query param `?tab=experiments`)
  • RED: Landing page has no tabs (current state)

Gap 4 (Medium): Experiment Detail Page

What

Clicking an experiment in the Experiments tab should show a dedicated page with all runs in that experiment.

Where to change

  • New route: `apps/studio/src/routes/experiments/$experimentName.tsx`
  • Modify: `ExperimentsTab.tsx` — rows link to this new route

Implementation

  1. Create route that fetches all runs, filters to those matching the experiment name
  2. Page shows: experiment name as heading, stat cards (Total Runs, Completed, Pass Rate, Targets), run table (same columns as landing but scoped)
  3. Sidebar shows experiment list with pass rate bars (similar to convex-evals experiment sidebar)

Red/Green Gates

  • GREEN: URL `/experiments/{experimentName}` renders a page with experiment-scoped runs
  • GREEN: Stat cards show experiment-level aggregates
  • GREEN: Clicking a row navigates to run detail
  • RED: No experiment detail page exists

Gap 5 (Medium): Breadcrumb Navigation

What

Convex-evals has a full breadcrumb trail: `Home > Experiment > Run > Category > Eval`. Currently we show simple "Run: X / Eval: Y" text.

Where to change

  • New component: `apps/studio/src/components/Breadcrumbs.tsx`
  • Modify: `apps/studio/src/components/Layout.tsx` — render breadcrumbs above page content
  • Use TanStack Router's `useMatches()` or `useRouterState()` to derive breadcrumb segments from the current route

Implementation

  1. Create a `Breadcrumbs` component that reads the current route matches
  2. Each segment is a clickable link: Home (/) > Run (timestamp) > Category (name) > Eval (testId)
  3. Separator: `>` or `/` between segments
  4. Last segment is non-clickable (current page)
  5. Styling: gray-400 text, cyan for links, truncate long segments

Red/Green Gates

  • GREEN: Breadcrumb bar visible on all pages below the root
  • GREEN: Each breadcrumb segment except the last is a clickable link
  • GREEN: Clicking a breadcrumb navigates to that level
  • GREEN: Breadcrumbs reflect the actual route hierarchy (not hardcoded)
  • RED: Only "Run: X / Eval: Y" plain text shown (current state)

Low Priority (do last or skip)

Step timing badges

Add duration next to pass/fail checkmarks in assertion steps: "✓ Output contains 'Hello' (0.2s)". Check `durationMs` on assertion entries.

  • GREEN: At least one step shows timing in parentheses
  • RED: Steps show only checkmark + text

Run metadata enrichment

Surface `target`, `experiment`, `eval_set` in run list and run detail headers.

  • GREEN: Run list table has a "Target" or "Experiment" column
  • RED: Run list only shows timestamp-based run IDs

Top navigation bar

Persistent top nav with "AgentV Studio" logo, breadcrumbs, and tab links.

  • GREEN: Top bar is visible on all pages with logo and navigation
  • RED: No top bar exists (sidebar only)

Pagination

"Load more" button or virtual scrolling for large result sets.

  • GREEN: Run list with 50+ entries shows pagination or virtual scroll
  • RED: All rows render at once regardless of count

Implementation Notes

  • All studio code lives in `apps/studio/src/`
  • Routes use TanStack Router file-based routing in `src/routes/`
  • Data fetching uses TanStack Query hooks in `src/lib/api.ts`
  • The Hono API in `apps/cli/src/commands/results/serve.ts` may need new endpoints
  • Build: `bun --filter @agentv/studio build`
  • Test: `bun --filter agentv test` (353 tests)
  • Lint: `biome check apps/studio/`
  • Dark theme uses Tailwind CSS 4 utilities (bg-gray-950, text-gray-100, etc.)

Verification Protocol

After implementing each gap, run `agentv studio` with test data (use `--dry-run-delay 100` to generate runs from examples/) and use agent-browser to screenshot each screen. Compare side-by-side with convex-evals reference screenshots in `research/findings/convex-evals/screenshots/`.

Non-Goals

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions