Skip to content

Embedding-based RAG pipeline with function-level chunking #22

@haasonsaas

Description

@haasonsaas

Problem

Greptile's 82% bug catch rate correlates directly with their embedding-based retrieval pipeline. DiffScope has a symbol index and symbol graph but no embedding-based semantic search. This means cross-file context is limited to what the symbol graph can structurally resolve, missing semantic relationships.

How Competitors Do It

Greptile (best-in-class)

Two-phase indexing:

  1. Chunker — breaks code into function-level chunks (not file-level). Their experiments show:
    • File-level chunking: 0.718 cosine similarity
    • Function-level chunking: 0.768 cosine similarity
    • Adding surrounding "noise" code degrades retrieval dramatically
  2. Summarizer — translates each code chunk to a natural language description before embedding:
    • Raw code-to-query similarity: 0.728
    • NL-translated code-to-query similarity: 0.815
  3. Storage: pgvector in PostgreSQL (not a dedicated vector DB)
  4. Embedding model: text-embedding-3-small (OpenAI, 1536 dimensions)

CodeRabbit

  • LanceDB (serverless vector DB on S3) for semantic code search
  • Combined with ast-grep for structural patterns

Qodo Merge (Pro)

  • Proprietary Qodo Embed-1 embedding model (top CoIR benchmark performer)
  • Language-specific static analysis for chunking (not naive line splitting)
  • Re-ranking step after initial semantic search

Proposed Solution

Phase 1: Indexing Pipeline

  1. Function-level chunker — DiffScope already has function_chunker.rs. Extend it to:

    • Extract function/method/class bodies as individual chunks
    • Preserve metadata (file path, line range, function name, enclosing class)
    • Support all languages the symbol index already covers
  2. NL translation — For each chunk, use a cheap/fast model (GPT-4o-mini or Haiku) to generate a natural language summary:

    "This function validates user input for the registration form, checking email format and password strength requirements. It returns a ValidationResult with any errors found."
    
  3. Embedding + storage — Embed the NL summaries (not raw code) using text-embedding-3-small and store in PostgreSQL with pgvector:

    CREATE TABLE code_chunks (
      id SERIAL PRIMARY KEY,
      repo_path TEXT,
      file_path TEXT,
      start_line INT,
      end_line INT,
      symbol_name TEXT,
      chunk_type TEXT, -- function, class, method, module
      raw_code TEXT,
      nl_summary TEXT,
      embedding vector(1536),
      updated_at TIMESTAMP
    );
    CREATE INDEX ON code_chunks USING ivfflat (embedding vector_cosine_ops);

Phase 2: Retrieval at Review Time

  1. For each changed function/hunk, generate a query describing the change
  2. Semantic search against pgvector for related code chunks
  3. Rank by relevance, filter by context budget
  4. Inject retrieved chunks into the review prompt as cross-file context

Phase 3: Incremental Updates

  • On each review, only re-index changed files
  • Cache NL summaries — only regenerate if code changed
  • Use git diff to identify stale chunks

Architecture Fit

DiffScope already has:

  • PostgreSQL support (storage_pg.rs) ✅
  • Function chunker (function_chunker.rs) ✅
  • Code summary module (code_summary.rs) ✅
  • Symbol index with multi-language support ✅
  • Context budget management (context_helpers.rs) ✅

The main new components are: NL translation step, pgvector storage, and semantic retrieval at review time.

Priority

Critical — highest-leverage improvement for review quality. This is the single biggest architectural difference between DiffScope and the 82% catch rate tools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions