Skip to content

Embedding-based false positive filtering from developer feedback #27

@haasonsaas

Description

@haasonsaas

Problem

DiffScope has a convention learner with Wilson score confidence intervals — statistically more rigorous than competitors. But Greptile's embedding-based approach to false positive filtering is empirically the most effective technique published in the space, taking their comment address rate from 19% to 55%+.

How Greptile Does It (from their "Make LLMs Shut Up" blog)

What failed:

  • Prompt engineering / few-shot: Model "inferred superficial characteristics" rather than learning meaningful patterns. Backfired.
  • LLM-as-judge: A secondary LLM rating comments 1-10 was "nearly random in its judgment of its own output"

What works:

  1. Store embeddings of all past review comments, tagged with developer 👍/👎 feedback
  2. For each new comment the LLM wants to post:
    • Compute cosine similarity against the feedback database
    • Block if similar to 3+ distinct downvoted comments
    • Pass if similar to 3+ upvoted comments
    • Pass ambiguous cases (not enough signal)
  3. Result: address rate went from 19% to 55%+

Key insight: "Nits are subjective — definitions and standards vary from team to team." This must be learned per-team, not universally.

Proposed Solution

Enhance the existing feedback system (FeedbackStore) with embedding-based similarity:

Data Model

CREATE TABLE review_feedback (
    id SERIAL PRIMARY KEY,
    repo TEXT NOT NULL,
    comment_text TEXT NOT NULL,
    comment_embedding vector(1536),
    category TEXT,  -- logic, style, security, etc.
    file_pattern TEXT,  -- e.g., "*.rs", "src/api/**"
    feedback TEXT NOT NULL,  -- 'accepted' or 'rejected'
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON review_feedback USING ivfflat (comment_embedding vector_cosine_ops);

Filtering Logic

async fn should_post_comment(
    comment: &Comment,
    feedback_db: &FeedbackDb,
    threshold: usize,  // default 3
    similarity_cutoff: f32,  // default 0.85
) -> bool {
    let embedding = embed(comment.text()).await?;
    let similar = feedback_db.find_similar(embedding, similarity_cutoff).await?;
    
    let rejected = similar.iter().filter(|f| f.feedback == "rejected").count();
    let accepted = similar.iter().filter(|f| f.feedback == "accepted").count();
    
    if rejected >= threshold { return false; }  // block
    if accepted >= threshold { return true; }   // pass
    true  // ambiguous → pass (err on side of posting)
}

Feedback Collection

  • diffscope feedback accept <comment-id> — existing CLI, add embedding storage
  • diffscope feedback reject <comment-id> — existing CLI, add embedding storage
  • GitHub reactions (👍/👎) on posted PR comments → auto-collect via webhook
  • Resolved/unresolved thread status → signal for accepted/rejected

Relationship to Existing Convention Learner

  • The Wilson score convention learner operates on exact pattern matches (rule_id, file pattern, category)
  • Embedding-based filtering operates on semantic similarity of the comment text
  • Both should run: Wilson score for structured rules, embeddings for fuzzy/subjective nits
  • The embedding filter runs first (cheap vector lookup), Wilson score augments

Expected Impact

Greptile's published numbers: 19% → 55%+ address rate. Even half that improvement would be significant for DiffScope's signal-to-noise ratio.

Priority

High — direct attack on the #1 churn driver (review fatigue from noisy comments).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions