Skip to content

Large PR stress test fixtures: progressive compression validation #66

@haasonsaas

Description

@haasonsaas

Problem

Issue #30 (Adaptive patch compression for large PRs) is open but has no eval fixtures to validate compression behavior. Current fixtures are all small, focused diffs. Real-world PRs can be 50+ files and thousands of lines — the reviewer needs to degrade gracefully, not silently drop findings.

Proposal

Create stress test fixtures that validate review quality under compression:

Fixtures

# name size expected behavior
1 `large-pr-50-files-mixed` 50 files, ~3000 lines Must still catch the 1 security issue buried in file #38
2 `large-pr-refactor-plus-bug` 30 files (28 renames + 2 real changes) Must not waste context on renames; must review the 2 substantive files
3 `large-pr-generated-code` 10 files but 1 is 2000-line generated proto Must skip generated file, review the rest
4 `large-pr-deletion-heavy` 20 files, 15 are pure deletions Must review the 5 non-deletion files; deletion-only may be skipped
5 `context-budget-exceeded` Single file, 5000-line diff Must use chunking/compression, not truncate randomly

Metrics

For each fixture, track:

  • Files reviewed vs files skipped (and why)
  • Compression strategy used (full / compressed / clipped / multi-call)
  • Finding recall compared to a "small diff" version of the same bug
  • Total tokens used vs context budget

Acceptance

  • 5 large-PR fixtures in `eval/fixtures/stress/`
  • Compression strategy logged per fixture
  • Security findings not dropped under compression
  • Triage correctly skips generated/deletion-only files

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: review-pipelineReview pipeline, context, promptseval-fixtureEval fixture / benchmark scenario

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions