-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Add an optional category field to the eval YAML schema to enable hierarchical organization: Category > Dataset > Test ID. This introduces a higher-level grouping above datasets (eval files) for projects with many eval files.
Motivation
Real-world projects accumulate dozens of eval YAML files. Without a grouping mechanism above the dataset level, the Studio run detail page becomes a flat list. A category field provides lightweight, optional organization similar to how convex-evals uses directory-based categories (000-fundamentals, 001-data_modeling).
Proposed hierarchy
| Level | Source | Example |
|---|---|---|
| Category | category field in eval YAML (defaults to "default") |
"Fundamentals", "Advanced", "Regression" |
| Dataset | name field or filename of eval YAML (was eval_set) |
"greeting-tests", "math-benchmark" |
| Test ID | id field on individual test |
"test-greeting", "test-addition" |
Example YAML
name: greeting-tests
category: Fundamentals
description: Basic greeting and politeness tests
tests:
- id: test-greeting
criteria: Agent should greet the user
# ...If category is omitted, it defaults to "default".
Objective
Add category as a suite-level field in eval YAML, propagate it through the pipeline to results, and add a two-level drill-down in Studio (Category > Dataset > Eval).
Design latitude
- Default category name: Implementer can choose
"default","Uncategorized", or"General"— pick whichever reads best in the UI. - Studio UI grouping: The run detail page can use collapsible sections, a two-column layout, or a separate route per category. Collapsible sections are simplest.
- API shape: The implementer can either nest datasets inside the categories response or keep them as separate endpoints. Separate endpoints are preferred for consistency.
Key files to change
Schema & types (packages/core/src/)
evaluation/validation/eval-file.schema.ts— addcategory: z.string().optional()at suite levelevaluation/types.ts— addcategory?: stringtoEvalTestandEvaluationResultevaluation/yaml-parser.ts— readcategoryfrom suite object, default to"default", assign to each test case (near line ~268 whereevalSetNameis extracted)evaluation/orchestrator.ts— passcategorythrough to results
Artifact pipeline (apps/cli/src/commands/)
eval/artifact-writer.ts— includecategoryin index manifest entriesresults/manifest.ts— addcategorytoResultManifestRecord, hydrate intoEvaluationResultresults/serve.ts— new endpoint:GET /api/runs/:filename/categories/:category/datasets; modify existing categories endpoint to group by the newcategoryfield instead ofdataset
Studio (apps/studio/src/) — assumes #812 is merged first
lib/types.ts— addcategorytoEvalResult; newCategoryWithDatasetsresponse typelib/api.ts— newuseCategoryDatasets(runId, category)hookcomponents/RunDetail.tsx— group by category first, then show dataset cards within each category sectionroutes/runs/$runId_.category.$category.tsx— show datasets in that category (not individual evals)- New route:
routes/runs/$runId_.category.$category.dataset.$dataset.tsx— show evals in that dataset components/Breadcrumbs.tsx— add dataset segment to breadcrumb trailcomponents/Sidebar.tsx— update drill-down: category sidebar shows datasets, dataset sidebar shows evals
Acceptance signals
-
category: Fundamentalsin eval YAML appears in JSONL output and Studio - Eval YAML without
categoryfield defaults to"default"category - Studio run detail groups datasets under category headers
- Drill-down: click category → see datasets → click dataset → see evals
- Breadcrumbs show full path: Home > Run > Category > Dataset > Eval
- All existing tests pass (no regressions)
- Old JSONL files without
categoryfield render under default category
Non-goals
- Nested categories (single level only)
- Auto-inferring category from directory structure
- Changing experiment or target semantics
Related
- refactor: rename eval_set to dataset across codebase #812 — Rename
eval_settodataset(prerequisite — must be merged first) - feat(studio): achieve full convex-evals feature parity #810 — Studio feature parity (current implementation uses eval_set as the sole grouping level)