-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
Greptile's 82% bug catch rate correlates directly with their embedding-based retrieval pipeline. DiffScope has a symbol index and symbol graph but no embedding-based semantic search. This means cross-file context is limited to what the symbol graph can structurally resolve, missing semantic relationships.
How Competitors Do It
Greptile (best-in-class)
Two-phase indexing:
- Chunker — breaks code into function-level chunks (not file-level). Their experiments show:
- File-level chunking: 0.718 cosine similarity
- Function-level chunking: 0.768 cosine similarity
- Adding surrounding "noise" code degrades retrieval dramatically
- Summarizer — translates each code chunk to a natural language description before embedding:
- Raw code-to-query similarity: 0.728
- NL-translated code-to-query similarity: 0.815
- Storage: pgvector in PostgreSQL (not a dedicated vector DB)
- Embedding model:
text-embedding-3-small(OpenAI, 1536 dimensions)
CodeRabbit
- LanceDB (serverless vector DB on S3) for semantic code search
- Combined with ast-grep for structural patterns
Qodo Merge (Pro)
- Proprietary Qodo Embed-1 embedding model (top CoIR benchmark performer)
- Language-specific static analysis for chunking (not naive line splitting)
- Re-ranking step after initial semantic search
Proposed Solution
Phase 1: Indexing Pipeline
-
Function-level chunker — DiffScope already has
function_chunker.rs. Extend it to:- Extract function/method/class bodies as individual chunks
- Preserve metadata (file path, line range, function name, enclosing class)
- Support all languages the symbol index already covers
-
NL translation — For each chunk, use a cheap/fast model (GPT-4o-mini or Haiku) to generate a natural language summary:
"This function validates user input for the registration form, checking email format and password strength requirements. It returns a ValidationResult with any errors found." -
Embedding + storage — Embed the NL summaries (not raw code) using
text-embedding-3-smalland store in PostgreSQL with pgvector:CREATE TABLE code_chunks ( id SERIAL PRIMARY KEY, repo_path TEXT, file_path TEXT, start_line INT, end_line INT, symbol_name TEXT, chunk_type TEXT, -- function, class, method, module raw_code TEXT, nl_summary TEXT, embedding vector(1536), updated_at TIMESTAMP ); CREATE INDEX ON code_chunks USING ivfflat (embedding vector_cosine_ops);
Phase 2: Retrieval at Review Time
- For each changed function/hunk, generate a query describing the change
- Semantic search against pgvector for related code chunks
- Rank by relevance, filter by context budget
- Inject retrieved chunks into the review prompt as cross-file context
Phase 3: Incremental Updates
- On each review, only re-index changed files
- Cache NL summaries — only regenerate if code changed
- Use git diff to identify stale chunks
Architecture Fit
DiffScope already has:
- PostgreSQL support (
storage_pg.rs) ✅ - Function chunker (
function_chunker.rs) ✅ - Code summary module (
code_summary.rs) ✅ - Symbol index with multi-language support ✅
- Context budget management (
context_helpers.rs) ✅
The main new components are: NL translation step, pgvector storage, and semantic retrieval at review time.
Priority
Critical — highest-leverage improvement for review quality. This is the single biggest architectural difference between DiffScope and the 82% catch rate tools.