Skip to content

feat(eval): add category field to eval YAML for hierarchical grouping #813

@christso

Description

@christso

Summary

Add an optional category field to the eval YAML schema to enable hierarchical organization: Category > Dataset > Test ID. This introduces a higher-level grouping above datasets (eval files) for projects with many eval files.

Motivation

Real-world projects accumulate dozens of eval YAML files. Without a grouping mechanism above the dataset level, the Studio run detail page becomes a flat list. A category field provides lightweight, optional organization similar to how convex-evals uses directory-based categories (000-fundamentals, 001-data_modeling).

Proposed hierarchy

Level Source Example
Category category field in eval YAML (defaults to "default") "Fundamentals", "Advanced", "Regression"
Dataset name field or filename of eval YAML (was eval_set) "greeting-tests", "math-benchmark"
Test ID id field on individual test "test-greeting", "test-addition"

Example YAML

name: greeting-tests
category: Fundamentals
description: Basic greeting and politeness tests

tests:
  - id: test-greeting
    criteria: Agent should greet the user
    # ...

If category is omitted, it defaults to "default".

Objective

Add category as a suite-level field in eval YAML, propagate it through the pipeline to results, and add a two-level drill-down in Studio (Category > Dataset > Eval).

Design latitude

  • Default category name: Implementer can choose "default", "Uncategorized", or "General" — pick whichever reads best in the UI.
  • Studio UI grouping: The run detail page can use collapsible sections, a two-column layout, or a separate route per category. Collapsible sections are simplest.
  • API shape: The implementer can either nest datasets inside the categories response or keep them as separate endpoints. Separate endpoints are preferred for consistency.

Key files to change

Schema & types (packages/core/src/)

  • evaluation/validation/eval-file.schema.ts — add category: z.string().optional() at suite level
  • evaluation/types.ts — add category?: string to EvalTest and EvaluationResult
  • evaluation/yaml-parser.ts — read category from suite object, default to "default", assign to each test case (near line ~268 where evalSetName is extracted)
  • evaluation/orchestrator.ts — pass category through to results

Artifact pipeline (apps/cli/src/commands/)

  • eval/artifact-writer.ts — include category in index manifest entries
  • results/manifest.ts — add category to ResultManifestRecord, hydrate into EvaluationResult
  • results/serve.ts — new endpoint: GET /api/runs/:filename/categories/:category/datasets; modify existing categories endpoint to group by the new category field instead of dataset

Studio (apps/studio/src/) — assumes #812 is merged first

  • lib/types.ts — add category to EvalResult; new CategoryWithDatasets response type
  • lib/api.ts — new useCategoryDatasets(runId, category) hook
  • components/RunDetail.tsx — group by category first, then show dataset cards within each category section
  • routes/runs/$runId_.category.$category.tsx — show datasets in that category (not individual evals)
  • New route: routes/runs/$runId_.category.$category.dataset.$dataset.tsx — show evals in that dataset
  • components/Breadcrumbs.tsx — add dataset segment to breadcrumb trail
  • components/Sidebar.tsx — update drill-down: category sidebar shows datasets, dataset sidebar shows evals

Acceptance signals

  • category: Fundamentals in eval YAML appears in JSONL output and Studio
  • Eval YAML without category field defaults to "default" category
  • Studio run detail groups datasets under category headers
  • Drill-down: click category → see datasets → click dataset → see evals
  • Breadcrumbs show full path: Home > Run > Category > Dataset > Eval
  • All existing tests pass (no regressions)
  • Old JSONL files without category field render under default category

Non-goals

  • Nested categories (single level only)
  • Auto-inferring category from directory structure
  • Changing experiment or target semantics

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions