-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestwuiRelates to the browser dashboard / web UI runtimeRelates to the browser dashboard / web UI runtime
Description
Objective
Extract and synthesize learnings from eval sessions — decisions with trade-offs, friction points, effective patterns — and surface them in the dashboard. Inspired by melagiri/code-insights which transforms AI coding sessions into actionable knowledge.
Architecture Boundary
external-first — dashboard analysis layer + optional plugin. Does not modify core eval engine.
What this enables
Eval runs generate rich data but insights are lost between sessions. This feature extracts durable knowledge:
- Decision extraction: What trade-offs did the agent make? What alternatives existed?
- Friction points: Which test categories consistently cause problems? Which tool calls fail most?
- Effective patterns: What prompt structures, tool sequences, or strategies correlate with high scores?
- Pattern export: Convert insights into quality gate rules or CLAUDE.md recommendations
Proposed capabilities
Insight Extraction (per-run)
- Analyze eval traces to extract: decisions made, tools used, error recovery patterns
- Score each insight by impact (how much did it affect the final score?)
- Tag insights by category (retrieval, reasoning, tool use, formatting)
Pattern Synthesis (cross-run)
- Aggregate insights across runs within a campaign
- Identify: recurring friction points, consistently effective strategies, degrading patterns
- Synthesis window: configurable (last 5 runs, last 7 days, full campaign)
Dashboard Views
- Insights feed: Recent insights ordered by impact
- Pattern trends: Which patterns are becoming more/less effective over time
- Friction heatmap: Test categories × failure types, showing persistent problem areas
- Recommendations: Auto-generated suggestions based on pattern analysis
Export
- Export effective patterns as quality gate rules (extends feat(eval): composable quality gates with auto-remediation triggers #334)
- Export recommendations as CLAUDE.md / .cursorrules entries (like code-insights' rule generation)
- Export friction report as markdown for team review
Design Latitude
- Whether insight extraction uses LLM analysis or heuristic pattern matching
- Synthesis algorithm (frequency-based, score-correlation, or LLM-summarized)
- Storage format for extracted insights
- How deep the code-insights integration goes (import their data vs. re-implement their approach)
Acceptance Signals
- Per-run insights extracted from eval traces
- Cross-run pattern synthesis identifies friction points and effective patterns
- Dashboard displays insights feed with impact scores
- Pattern trends visible over time
- At least one export format (quality gate rules or CLAUDE.md recommendations)
Non-Goals
- Real-time session monitoring (code-insights' primary use case)
- AI fluency scoring (code-insights feature, doesn't map to eval framework)
- Multi-tool analysis (focus on agentv eval data, not external tool sessions)
- Replacing code-insights (complementary, not competitive)
Dependencies
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 (dashboard platform) — the UI platform this lives in
- feat(eval): composable quality gates with auto-remediation triggers #334 (quality gates) — export target for generated rules
- feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335 (iteration tracking) — provides cross-run data for synthesis
Research source
- melagiri/code-insights — session analysis, pattern detection, weekly synthesis, rule generation, AI fluency scoring
- code-insights architecture: Vite + React SPA, Hono API server, SQLite local-first storage, Ollama for free LLM analysis
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestwuiRelates to the browser dashboard / web UI runtimeRelates to the browser dashboard / web UI runtime
Type
Projects
Status
Backlog