-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Follow-up to #805 (merged in #806). The AgentV Studio scaffold is functional with run list, run detail, eval detail (Steps/Output/Task), category breakdown, eval sidebar, and failure reason display. This issue tracks the remaining features needed for full parity with the convex-evals visualizer.
Research
- Convex-evals UX analysis — 512-line screen-by-screen breakdown with screenshots
- Convex-evals codebase analysis — architecture, data model, components
- AgentV Studio test report — browser-automated testing with 18 screenshots
- Tech stack comparison — convex-evals vs agentv stack decisions
- Convex-evals screenshots — reference screenshots for each screen
- Live reference: https://convex-evals.netlify.app/ — browse with agent-browser to see target UX
- Convex-evals source: cloned at
/home/christso/projects/convex-evals/(visualizer invisualizer/src/)
Screenshot Map (target state for each gap)
| Gap | Reference Screenshot | What it shows |
|---|---|---|
| File tree (Output tab) | 06-eval-output-tab.png, 16-output-code-file.png |
File tree left panel + Monaco right panel, .ts file with syntax highlighting |
| Category drill-down | 13-category-view.png |
Category page with scoped stat cards and eval list |
| Experiments tab | 08-experiments-tab.png |
Landing page Experiments tab with table |
| Targets/Models tab | 09-models-tab.png |
Landing page Models tab with score bars |
| Experiment detail | 10-experiment-detail.png, 12-experiments-sidebar-detail.png |
Experiment page with sidebar run list, stat cards, run table |
| Breadcrumbs | 05-eval-detail.png |
Full breadcrumb trail at top: Home > Experiment > Run > Category > Eval |
Current State (what #806 shipped)
- Landing page with run list table (Tests Passing, Mean Score columns)
- Run detail with stat cards (Total, Passed, Failed, Pass Rate)
- Category breakdown section with score bars and filter
- Eval detail with 3-tab view (Steps, Output, Task)
- Context-aware eval sidebar with pass/fail indicators
- Failure reason display (red-tinted panel)
- Monaco Editor for output/task viewing
- Cyan→blue gradient score bars matching convex-evals aesthetic
- Dark theme (bg-gray-950)
- URL routing via TanStack Router
- `agentv studio` command (with `agentv serve` as hidden alias)
- Empty state handling ("No evaluations found")
Gap 1 (High): File Tree in Output/Task Tabs
What
Convex-evals Output tab shows a collapsible directory tree alongside Monaco Editor. Users click individual files to view them with syntax highlighting. Currently agentv dumps serialized conversation text into a single Monaco panel.
Where to change
- New component: `apps/studio/src/components/FileTree.tsx` — collapsible tree with folder/file icons
- Modify: `apps/studio/src/components/EvalDetail.tsx` — `OutputTab` and `TaskTab` functions
- New API endpoint: `GET /api/runs/:filename/evals/:evalId/files` in `apps/cli/src/commands/results/serve.ts` — returns file tree from the eval's artifact directory (grading/, timing/, input/, output/ folders)
- Reference: convex-evals `visualizer/src/lib/evalComponents.tsx` and screenshot `06-eval-output-tab.png`
Implementation
- Add Hono endpoint that reads the eval's run directory, lists files in `input/`, `output/`, `grading/`, `timing/` subdirectories, and returns a tree structure: `{ name, path, type: "file"|"dir", children? }`
- Create `FileTree` component: collapsible folders, file-type icons (use emoji like convex-evals: 📁 folder, 📘 .ts, 📋 .json, 📜 .log), click-to-select highlighting
- Split Output/Task tab into left panel (FileTree, ~250px) + right panel (MonacoViewer)
- On file click, fetch file content via `GET /api/runs/:filename/evals/:evalId/files/:path` and render in Monaco with language auto-detection from extension
- Default selection: first file in tree, or `run.log` if it exists
Red/Green Gates
- GREEN: Output tab renders a split layout with file tree on the left and Monaco on the right
- GREEN: Clicking a `.ts` file shows TypeScript syntax highlighting in Monaco
- GREEN: Clicking a `.json` file shows JSON syntax highlighting in Monaco
- GREEN: Folders are collapsible (click to expand/collapse)
- GREEN: File tree shows at least input and output artifact directories
- RED: Output tab shows raw serialized conversation text in a single panel (current state)
- RED: No file selection interaction exists
Gap 2 (High): Category Drill-Down Page
What
Convex-evals has a dedicated page per category (e.g., `/experiment/no_guidelines/run/abc/Fundamentals`) showing stat cards and eval list scoped to that category. Currently agentv only filters the flat list via category card clicks on the run detail page.
Where to change
- New route: `apps/studio/src/routes/runs/$runId.category.$category.tsx`
- Modify: `apps/studio/src/components/RunDetail.tsx` — category cards should `` to the new route instead of toggling a filter
- Modify: `apps/studio/src/components/Sidebar.tsx` — on category pages, sidebar should show eval list for that category only
Implementation
- Create route file `runs/$runId.category.$category.tsx` that fetches run data and filters to evals matching the category
- Page shows: category name as heading, stat cards (Total, Passed, Failed, Pass Rate scoped to category), eval table
- Category cards on run detail page become `` instead of `onClick` filter toggle
- Sidebar on category page shows evals in that category with pass/fail indicators
Red/Green Gates
- GREEN: URL `/runs/{runId}/category/{categoryName}` renders a page with category-scoped stat cards and eval list
- GREEN: Clicking a category card on run detail navigates to `/runs/{runId}/category/{categoryName}`
- GREEN: Sidebar on category page shows only evals in that category
- GREEN: Browser back from category page returns to run detail
- RED: Category cards only toggle a client-side filter (current state)
- RED: No `/runs/{runId}/category/{categoryName}` URL exists
Gap 3 (Medium): Landing Page Tabs — Experiments & Targets
What
Convex-evals landing page has 3 tabs: Recent Runs (default), Experiments, Models. AgentV has only the run list.
Where to change
- Modify: `apps/studio/src/routes/index.tsx` — add tab bar and tab content components
- New components: `apps/studio/src/components/ExperimentsTab.tsx`, `apps/studio/src/components/TargetsTab.tsx`
- New API endpoint: `GET /api/experiments` and `GET /api/targets` in serve.ts — aggregate across all runs
Implementation
- Add tab bar to landing page: "Recent Runs" | "Experiments" | "Targets" (use same tab styling as eval detail: cyan underline for active)
- Experiments tab: table with columns — Experiment, Runs, Targets, Evals (passed/total), Pass Rate (score bar), Last Run. Group data by `experiment` field across all runs.
- Targets tab: table with columns — Target, Runs, Experiments, Evals (passed/total), Pass Rate (score bar). Group data by `target` field across all runs.
- API endpoints aggregate from existing run index data (no new data sources needed)
- Rows in both tabs should be clickable, navigating to filtered views
Red/Green Gates
- GREEN: Landing page shows 3 tabs: "Recent Runs", "Experiments", "Targets"
- GREEN: Experiments tab shows a table with at least: experiment name, run count, pass rate bar
- GREEN: Targets tab shows a table with at least: target name, run count, pass rate bar
- GREEN: Active tab has cyan underline, inactive tabs are gray
- GREEN: Tab state persists on page refresh (via URL query param `?tab=experiments`)
- RED: Landing page has no tabs (current state)
Gap 4 (Medium): Experiment Detail Page
What
Clicking an experiment in the Experiments tab should show a dedicated page with all runs in that experiment.
Where to change
- New route: `apps/studio/src/routes/experiments/$experimentName.tsx`
- Modify: `ExperimentsTab.tsx` — rows link to this new route
Implementation
- Create route that fetches all runs, filters to those matching the experiment name
- Page shows: experiment name as heading, stat cards (Total Runs, Completed, Pass Rate, Targets), run table (same columns as landing but scoped)
- Sidebar shows experiment list with pass rate bars (similar to convex-evals experiment sidebar)
Red/Green Gates
- GREEN: URL `/experiments/{experimentName}` renders a page with experiment-scoped runs
- GREEN: Stat cards show experiment-level aggregates
- GREEN: Clicking a row navigates to run detail
- RED: No experiment detail page exists
Gap 5 (Medium): Breadcrumb Navigation
What
Convex-evals has a full breadcrumb trail: `Home > Experiment > Run > Category > Eval`. Currently we show simple "Run: X / Eval: Y" text.
Where to change
- New component: `apps/studio/src/components/Breadcrumbs.tsx`
- Modify: `apps/studio/src/components/Layout.tsx` — render breadcrumbs above page content
- Use TanStack Router's `useMatches()` or `useRouterState()` to derive breadcrumb segments from the current route
Implementation
- Create a `Breadcrumbs` component that reads the current route matches
- Each segment is a clickable link: Home (/) > Run (timestamp) > Category (name) > Eval (testId)
- Separator: `>` or `/` between segments
- Last segment is non-clickable (current page)
- Styling: gray-400 text, cyan for links, truncate long segments
Red/Green Gates
- GREEN: Breadcrumb bar visible on all pages below the root
- GREEN: Each breadcrumb segment except the last is a clickable link
- GREEN: Clicking a breadcrumb navigates to that level
- GREEN: Breadcrumbs reflect the actual route hierarchy (not hardcoded)
- RED: Only "Run: X / Eval: Y" plain text shown (current state)
Low Priority (do last or skip)
Step timing badges
Add duration next to pass/fail checkmarks in assertion steps: "✓ Output contains 'Hello' (0.2s)". Check `durationMs` on assertion entries.
- GREEN: At least one step shows timing in parentheses
- RED: Steps show only checkmark + text
Run metadata enrichment
Surface `target`, `experiment`, `eval_set` in run list and run detail headers.
- GREEN: Run list table has a "Target" or "Experiment" column
- RED: Run list only shows timestamp-based run IDs
Top navigation bar
Persistent top nav with "AgentV Studio" logo, breadcrumbs, and tab links.
- GREEN: Top bar is visible on all pages with logo and navigation
- RED: No top bar exists (sidebar only)
Pagination
"Load more" button or virtual scrolling for large result sets.
- GREEN: Run list with 50+ entries shows pagination or virtual scroll
- RED: All rows render at once regardless of count
Implementation Notes
- All studio code lives in `apps/studio/src/`
- Routes use TanStack Router file-based routing in `src/routes/`
- Data fetching uses TanStack Query hooks in `src/lib/api.ts`
- The Hono API in `apps/cli/src/commands/results/serve.ts` may need new endpoints
- Build: `bun --filter @agentv/studio build`
- Test: `bun --filter agentv test` (353 tests)
- Lint: `biome check apps/studio/`
- Dark theme uses Tailwind CSS 4 utilities (bg-gray-950, text-gray-100, etc.)
Verification Protocol
After implementing each gap, run `agentv studio` with test data (use `--dry-run-delay 100` to generate runs from examples/) and use agent-browser to screenshot each screen. Compare side-by-side with convex-evals reference screenshots in `research/findings/convex-evals/screenshots/`.
Non-Goals
- Run comparison / diff view (tracked separately in project: AgentV Studio — eval management platform with quality gates, orchestration, and analysis #788)
- Trend charts / historical analysis (tracked in project: AgentV Studio — eval management platform with quality gates, orchestration, and analysis #788 sub-issues)
- Quality gates, orchestration control (Phase 3 per project: AgentV Studio — eval management platform with quality gates, orchestration, and analysis #788)
- Mobile responsiveness
- Real-time SSE streaming (Phase 2)
Related
- feat: scaffold AgentV Studio with convex-evals dashboard feature parity #805 — Original scaffold issue (closed by feat(studio): scaffold AgentV Studio SPA with dashboard feature parity #806)
- project: AgentV Studio — eval management platform with quality gates, orchestration, and analysis #788 — Tracking: AgentV Studio eval management platform
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 — Original platform issue
Metadata
Metadata
Assignees
Labels
Type
Projects
Status