Skip to content

File triage: classify files before expensive review #29

@haasonsaas

Description

@haasonsaas

Problem

DiffScope reviews every changed file with the same model and depth. CodeRabbit uses a cheap model to classify each file as NEEDS_REVIEW or APPROVED (cosmetic/formatting change) before spending expensive model tokens. This reduces cost and noise significantly.

How CodeRabbit Does It

  1. A lightweight model classifies each changed file:
    • "Does this file contain logic/functionality changes, or is it purely cosmetic/formatting?"
    • Classification: NEEDS_REVIEW or APPROVED
  2. Files classified as APPROVED skip the detailed review entirely
  3. Rate limits: Free=150 files max, Pro=300 files max

How Qodo Does It

  • Files sorted by main language first, then by token count descending
  • patch_extension_skip_types = [".md", ".txt"] — certain file types auto-skipped
  • Deletion-only hunks removed via omit_deletion_hunks() before the expensive call

Proposed Solution

Add a triage step before the main review:

Implementation

#[derive(Debug)]
enum TriageResult {
    NeedsReview,        // Logic/functionality change — full review
    Cosmetic,           // Formatting, whitespace, comments-only — skip
    ConfigChange,       // Config/env changes — lightweight review
    TestOnly,           // Test changes — review with different rules
    DeletionOnly,       // File deleted or lines-only removed — skip
    Generated,          // Auto-generated code — skip
}

async fn triage_file(
    diff: &UnifiedDiff,
    model: &ModelConfig,  // use weak/cheap model
) -> TriageResult {
    // Heuristic checks first (no LLM needed):
    // - All-whitespace changes → Cosmetic
    // - Deletion-only hunks → DeletionOnly
    // - Known generated file patterns → Generated
    // - Lock files, vendor dirs → Generated
    
    // LLM classification for ambiguous cases:
    // - "Is this a logic change or purely cosmetic?"
    // - Use cheap model (Haiku, GPT-4o-mini)
}

Heuristic-Only Triage (no LLM cost)

Many files can be triaged without any LLM call:

  • Lock files (Cargo.lock, package-lock.json, yarn.lock)
  • Generated code (.generated., _generated/)
  • Binary files
  • Deletion-only changes
  • Whitespace-only changes
  • Comment-only changes (parse for //, #, /* */ patterns)

Configuration

triage:
  enabled: true
  model: null  # null = use weak model, or specify explicitly
  skip_patterns:
    - "*.lock"
    - "*.generated.*"
    - "vendor/**"
  auto_approve:
    - deletion_only: true
    - whitespace_only: true
    - comment_only: true

Expected Impact

Priority

Medium — cost and noise reduction. Quick win that compounds with scale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions