Embedding-based RAG pipeline with function-level chunking

## Problem

Greptile's 82% bug catch rate correlates directly with their embedding-based retrieval pipeline. DiffScope has a symbol index and symbol graph but no embedding-based semantic search. This means cross-file context is limited to what the symbol graph can structurally resolve, missing semantic relationships.

## How Competitors Do It

### Greptile (best-in-class)
Two-phase indexing:
1. **Chunker** — breaks code into **function-level chunks** (not file-level). Their experiments show:
   - File-level chunking: 0.718 cosine similarity
   - Function-level chunking: 0.768 cosine similarity
   - Adding surrounding "noise" code degrades retrieval dramatically
2. **Summarizer** — translates each code chunk to a **natural language description** before embedding:
   - Raw code-to-query similarity: 0.728
   - NL-translated code-to-query similarity: **0.815**
3. Storage: **pgvector** in PostgreSQL (not a dedicated vector DB)
4. Embedding model: `text-embedding-3-small` (OpenAI, 1536 dimensions)

### CodeRabbit
- **LanceDB** (serverless vector DB on S3) for semantic code search
- Combined with ast-grep for structural patterns

### Qodo Merge (Pro)
- Proprietary **Qodo Embed-1** embedding model (top CoIR benchmark performer)
- Language-specific static analysis for chunking (not naive line splitting)
- Re-ranking step after initial semantic search

## Proposed Solution

### Phase 1: Indexing Pipeline
1. **Function-level chunker** — DiffScope already has `function_chunker.rs`. Extend it to:
   - Extract function/method/class bodies as individual chunks
   - Preserve metadata (file path, line range, function name, enclosing class)
   - Support all languages the symbol index already covers

2. **NL translation** — For each chunk, use a cheap/fast model (GPT-4o-mini or Haiku) to generate a natural language summary:
   ```
   "This function validates user input for the registration form, checking email format and password strength requirements. It returns a ValidationResult with any errors found."
   ```

3. **Embedding + storage** — Embed the NL summaries (not raw code) using `text-embedding-3-small` and store in PostgreSQL with pgvector:
   ```sql
   CREATE TABLE code_chunks (
     id SERIAL PRIMARY KEY,
     repo_path TEXT,
     file_path TEXT,
     start_line INT,
     end_line INT,
     symbol_name TEXT,
     chunk_type TEXT, -- function, class, method, module
     raw_code TEXT,
     nl_summary TEXT,
     embedding vector(1536),
     updated_at TIMESTAMP
   );
   CREATE INDEX ON code_chunks USING ivfflat (embedding vector_cosine_ops);
   ```

### Phase 2: Retrieval at Review Time
1. For each changed function/hunk, generate a query describing the change
2. Semantic search against pgvector for related code chunks
3. Rank by relevance, filter by context budget
4. Inject retrieved chunks into the review prompt as cross-file context

### Phase 3: Incremental Updates
- On each review, only re-index changed files
- Cache NL summaries — only regenerate if code changed
- Use git diff to identify stale chunks

## Architecture Fit

DiffScope already has:
- PostgreSQL support (`storage_pg.rs`) ✅
- Function chunker (`function_chunker.rs`) ✅
- Code summary module (`code_summary.rs`) ✅
- Symbol index with multi-language support ✅
- Context budget management (`context_helpers.rs`) ✅

The main new components are: NL translation step, pgvector storage, and semantic retrieval at review time.

## Priority

**Critical — highest-leverage improvement for review quality.** This is the single biggest architectural difference between DiffScope and the 82% catch rate tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding-based RAG pipeline with function-level chunking #22

Problem

How Competitors Do It

Greptile (best-in-class)

CodeRabbit

Qodo Merge (Pro)

Proposed Solution

Phase 1: Indexing Pipeline

Phase 2: Retrieval at Review Time

Phase 3: Incremental Updates

Architecture Fit

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Embedding-based RAG pipeline with function-level chunking #22

Description

Problem

How Competitors Do It

Greptile (best-in-class)

CodeRabbit

Qodo Merge (Pro)

Proposed Solution

Phase 1: Indexing Pipeline

Phase 2: Retrieval at Review Time

Phase 3: Incremental Updates

Architecture Fit

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions