Skip to content

rdmgator12/SourceMind

SourceMind

CI Python 3.12 License: BSL 1.1 Tests Version

SourceMind turns 1,000-page document sets into a question-answerable knowledge base with page-level citations — then augments answers with live PubMed, clinical trial, and legal research.

Built on Recursive Language Models (RLMs), not basic RAG. The RLM writes and executes Python code to navigate documents — handling inputs 2 orders of magnitude beyond typical context windows.


Key Capabilities

  • Upload documents up to 1,000+ pages (PDF, TXT, MD, DOCX)
  • Ask natural language questions across one or many documents
  • Run medical-legal reviews — automated 5-step pipeline with per-facility map-reduce
  • Search PubMed (36M+ citations), ClinicalTrials.gov, and case law (Midpage) mid-answer
  • Get precise answers with [Source: filename, Page N] citations — click to view the original passage
  • Track API costs per query in real time

Use Cases

  • Med-legal expert witness: Full standard-of-care review across 1,000+ pages from multiple facilities
  • Clinical research: Synthesize findings across uploaded docs + PubMed literature
  • Clinical trial matching: Upload patient records, auto-extract diagnosis, find eligible trials
  • Legal research: Search case law and analyze judicial opinions alongside medical records
  • General knowledge work: Any professional who needs precision over large document sets

Screenshots

Notebook workspace with 4 documents loaded
Notebook workspace — document panel with page counts, token totals, and conversation history

Query response with inline citations
Query with inline citations — click any citation button to view the source passage

Citation viewer panel showing source passage
Citation viewer — original passage highlighted with page navigation


Architecture

flowchart TB
    subgraph Frontend["Frontend (React + TypeScript + Tailwind)"]
        UI[Notebook Workspace]
        DP[Document Panel]
        CP[Chat Panel]
        CV[Citation Viewer]
    end

    subgraph Backend["Backend (Python + FastAPI)"]
        ING[Document Ingestion<br/>PDF · TXT · MD · DOCX]
        CDI[Cross-Document Indexer<br/>classify · extract · timeline]
        SR[Smart Router]
        RLM[RLM Engine<br/>Root LM + Sub LM + REPL]
        RP[Review Pipeline<br/>5-step med-legal analysis]
        AH[Anti-Hallucination Stack<br/>refusal · consistency · temporal]
        CT[Citation Extractor + Normalizer]
    end

    subgraph External["External Research"]
        PM[PubMed<br/>36M+ citations]
        CTG[ClinicalTrials.gov]
        MP[Midpage Legal<br/>case law search]
    end

    UI --> ING
    UI --> SR
    SR -->|"< 150K tokens"| API[Direct Claude API]
    SR -->|">= 150K tokens"| RLM
    SR -->|"med-legal review"| RP
    RP --> RLM
    RLM --> PM & CTG & MP
    RLM --> CT
    RP --> AH
    CDI --> RP
    ING --> CDI
    CT --> CV
Loading

Smart Router

Context Size Strategy Why
< 150K tokens Direct Claude API call Faster, cheaper for shorter inputs
>= 150K tokens RLM Engine (recursive navigation) Agentic code-driven navigation across massive contexts
Medical-legal review Review Pipeline (5-step) Structured multi-pass analysis with per-facility map-reduce

RLM Engine

  1. Root LM (Claude Sonnet 4) examines document structure via generated Python code
  2. Sub LM (Claude Haiku 4.5) analyzes individual passages for semantic understanding
  3. External tools available in the REPL: PubMed, ClinicalTrials.gov, Midpage legal research
  4. Code execution filters, searches, and navigates — the model decides HOW to traverse
  5. All claims require citations, verified against source text

Sample Queries

Input: "Did the patient meet the diagnostic criteria for incomplete Kawasaki disease at any facility visit?"

Output (excerpt): The patient presented with fever of 5 days duration and CRP of 8.2 mg/dL (Page 15, TGH_Record.pdf), meeting the entry criteria of CRP ≥3.0 mg/dL [FACT]. Of the 6 supplementary laboratory criteria, 3 were met: Albumin 2.8 g/dL ≤3.0 (Page 18), WBC 16,200/uL ≥15,000 (Page 15), and Urine WBC 12/HPF ≥10 (Page 22) [FACT]. The AHA guideline threshold of ≥3 of 6 criteria was reached [FACT]. CONFIDENCE: 8/10 (Citation: 9, Guideline: 9, Allegations: 7, Corroboration: 7)

Input: "Search PubMed for the latest meta-analyses on IV thrombolysis window extension beyond 4.5 hours"

Output (excerpt): A 2024 individual patient data meta-analysis (PMID: 38901847) pooling 6 RCTs (n=3,412) found that alteplase administered 4.5–9 hours after onset in patients selected by perfusion imaging had significantly better functional outcomes (mRS 0-1: OR 1.49, 95% CI 1.10–2.01) [PMID: 38901847].


Quick Start

Prerequisites

1. Clone

git clone https://github.com/rdmgator12/SourceMind.git
cd SourceMind

2. Backend

cd backend
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY

pip install -r requirements.lock  # or requirements.txt for latest compatible
uvicorn app.main:app --reload

Backend runs at http://localhost:8000

3. Frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:3000 (proxies API calls to backend)

Docker (Production)

cp backend/.env.example backend/.env
# Edit backend/.env with your API key

docker-compose up --build

App available at http://localhost:3000


Configuration

Variable Default Description
ANTHROPIC_API_KEY Your Anthropic API key (required)
ROOT_MODEL claude-sonnet-4-20250514 Root LM for orchestration
SUB_MODEL claude-haiku-4-5-20251001 Sub LM for recursive passage analysis
RLM_ENVIRONMENT local REPL sandbox (local or docker)
RLM_MAX_RECURSION_DEPTH 3 Max recursive depth per query
RLM_TIMEOUT_SECONDS 900 Max execution time per query
RLM_MAX_BUDGET_USD 10.00 Max API spend per query
RLM_PER_STEP_BUDGET_USD 3.00 Max API spend per pipeline step
MAX_FILE_SIZE_MB 150 Upload size limit
ALLOWED_EXTENSIONS pdf,txt,md,docx Accepted file types
NCBI_API_KEY Optional NCBI key (increases PubMed rate limit to 10/sec)
NCBI_EMAIL Optional email for NCBI API usage

API

POST   /api/notebooks                         Create notebook
GET    /api/notebooks                         List notebooks
GET    /api/notebooks/:id                     Get notebook
DELETE /api/notebooks/:id                     Delete notebook

POST   /api/notebooks/:id/documents           Upload document
GET    /api/notebooks/:id/documents           List documents
DELETE /api/notebooks/:id/documents/:did      Remove document
GET    /api/documents/:did/page/:page         Get page text

POST   /api/notebooks/:id/query              Submit query (multi-source)
POST   /api/notebooks/:id/review             Run medical-legal review
GET    /api/notebooks/:id/conversations       List conversations
GET    /api/conversations/:cid               Get conversation

GET    /api/stats                             Usage stats

WS     /ws/query/:notebook_id                 Streaming query via WebSocket
WS     /ws/review/:notebook_id               Streaming review via WebSocket

Interactive API docs (Swagger UI) available at http://localhost:8000/docs when running locally.


Tech Stack

Layer Technology
Frontend React 18, TypeScript, Tailwind CSS, Zustand
Backend Python 3.12, FastAPI, SQLAlchemy, aiosqlite
RLM rlms library (MIT), Anthropic backend
Document Parsing PyMuPDF (PDF), python-docx (DOCX)
Literature Search PubMed E-utilities (NCBI)
Trial Matching ClinicalTrials.gov v2 API
Legal Research Midpage Legal Research (MCP)
LLMs Claude Sonnet 4 (root) + Haiku 4.5 (sub)
Deploy Docker Compose, Nginx

Testing

cd backend
python -m pytest tests/ -v

296 tests covering: review pipeline, cross-document indexer, document selector, citation normalization, smart router, ingestion pipeline, consistency checker, refusal detection, temporal guard, facility normalization, cost tracking, and adversarial edge cases.

CI runs automatically on every push via GitHub Actions.


Contributing

See CONTRIBUTING.md for setup instructions, test requirements, and PR guidelines.


License

Business Source License 1.1 — free for non-competitive use; converts to Apache 2.0 on 2030-03-22. See LICENSE for full terms.


Built by Ralph Martello & Elle.

About

Recursive Document Intelligence Engine — turn 1,000-page document sets into a question-answerable knowledge base with page-level citations. Live PubMed, clinical trial, and legal research. 296 tests. Not RAG.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors