-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
DiffScope has a convention learner with Wilson score confidence intervals — statistically more rigorous than competitors. But Greptile's embedding-based approach to false positive filtering is empirically the most effective technique published in the space, taking their comment address rate from 19% to 55%+.
How Greptile Does It (from their "Make LLMs Shut Up" blog)
What failed:
- Prompt engineering / few-shot: Model "inferred superficial characteristics" rather than learning meaningful patterns. Backfired.
- LLM-as-judge: A secondary LLM rating comments 1-10 was "nearly random in its judgment of its own output"
What works:
- Store embeddings of all past review comments, tagged with developer 👍/👎 feedback
- For each new comment the LLM wants to post:
- Compute cosine similarity against the feedback database
- Block if similar to 3+ distinct downvoted comments
- Pass if similar to 3+ upvoted comments
- Pass ambiguous cases (not enough signal)
- Result: address rate went from 19% to 55%+
Key insight: "Nits are subjective — definitions and standards vary from team to team." This must be learned per-team, not universally.
Proposed Solution
Enhance the existing feedback system (FeedbackStore) with embedding-based similarity:
Data Model
CREATE TABLE review_feedback (
id SERIAL PRIMARY KEY,
repo TEXT NOT NULL,
comment_text TEXT NOT NULL,
comment_embedding vector(1536),
category TEXT, -- logic, style, security, etc.
file_pattern TEXT, -- e.g., "*.rs", "src/api/**"
feedback TEXT NOT NULL, -- 'accepted' or 'rejected'
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON review_feedback USING ivfflat (comment_embedding vector_cosine_ops);Filtering Logic
async fn should_post_comment(
comment: &Comment,
feedback_db: &FeedbackDb,
threshold: usize, // default 3
similarity_cutoff: f32, // default 0.85
) -> bool {
let embedding = embed(comment.text()).await?;
let similar = feedback_db.find_similar(embedding, similarity_cutoff).await?;
let rejected = similar.iter().filter(|f| f.feedback == "rejected").count();
let accepted = similar.iter().filter(|f| f.feedback == "accepted").count();
if rejected >= threshold { return false; } // block
if accepted >= threshold { return true; } // pass
true // ambiguous → pass (err on side of posting)
}Feedback Collection
diffscope feedback accept <comment-id>— existing CLI, add embedding storagediffscope feedback reject <comment-id>— existing CLI, add embedding storage- GitHub reactions (👍/👎) on posted PR comments → auto-collect via webhook
- Resolved/unresolved thread status → signal for accepted/rejected
Relationship to Existing Convention Learner
- The Wilson score convention learner operates on exact pattern matches (rule_id, file pattern, category)
- Embedding-based filtering operates on semantic similarity of the comment text
- Both should run: Wilson score for structured rules, embeddings for fuzzy/subjective nits
- The embedding filter runs first (cheap vector lookup), Wilson score augments
Expected Impact
Greptile's published numbers: 19% → 55%+ address rate. Even half that improvement would be significant for DiffScope's signal-to-noise ratio.
Priority
High — direct attack on the #1 churn driver (review fatigue from noisy comments).