A next-generation PDF Q&A platform combining RAPTOR hierarchical indexing, Hybrid BM25 + Dense retrieval, and Llama 3.3 70B via Groq to turn any textbook into an interactive AI tutor — with source citations, code examples, and complexity analysis in every answer.
- 🚀 What is ResolvaBot?
- ✨ Features
- 🏗️ Architecture
- 📁 Project Structure
- ⚙️ System Requirements
- 🛠️ Installation
- 🔑 API Keys Setup
▶️ Running the App- 🖥️ Using the Interface
- 🔬 Technical Deep Dive
- 🎨 UI Themes
- 🧪 Testing the Pipeline
- 🔧 Troubleshooting
- 📦 Full Dependencies List
- 🤝 Contributing
ResolvaBot LLM is a desktop-first, locally-hosted AI study assistant that transforms any PDF textbook into a fully searchable, intelligent Q&A knowledge base. It combines cutting-edge retrieval techniques with large language models to deliver structured, cited, code-inclusive answers — just like having a senior tutor available 24/7.
Who is it built for?
| Audience | Use Case |
|---|---|
| 🎓 Students | Ask questions directly from CS / algorithms / science textbooks |
| 👨🏫 Educators | Surface relevant passages for lecture preparation instantly |
| 🔬 Researchers | Query technical PDFs, papers, and manuals with precision |
| 💻 Developers | Understand codebases by uploading technical documentation |
| Feature | Details |
|---|---|
| RAPTOR Hierarchical Index | Recursive GMM clustering + LLM summarization — builds a tree-structured index for multi-granularity retrieval (leaf → branch → root) |
| Hybrid BM25 + Dense Search | Keyword matching (Whoosh BM25) + SBERT semantic vectors (FAISS cosine) working together |
| Reciprocal Rank Fusion (RRF) | Merges and re-ranks results from both retrieval branches for maximum combined relevance |
| WordNet Query Expansion | Automatically expands queries with synonyms to improve recall on paraphrased content |
| Multi-Model LLM Auto-Fallback | 5-level fallback: Groq Llama 3.3 70B → Llama 3.1 70B → Llama 3 8B → OpenAI GPT-3.5 → Wikipedia → Raw context |
| Wikipedia Live Fallback | When PDF context is insufficient, fetches real-time Wikipedia articles to answer |
| Source Passage Attribution | Every answer shows which passages were retrieved and their exact RRF relevance scores |
| Markdown + Code Answers | All LLM responses render full markdown — headers, code blocks with syntax highlighting, tables, bold |
| Feature | Details |
|---|---|
| PyMuPDF Extraction | Fast, accurate text extraction from any digital PDF |
| Tesseract OCR Fallback | Automatically uses OCR for scanned / image-based PDFs when detected |
| NLTK Sentence Chunking | Smart chunking that preserves sentence boundaries (~100 tokens per chunk) |
| SBERT Embeddings | all-MiniLM-L6-v2 — 384-dimensional semantic embeddings, CPU-fast |
| FAISS Vector Store | In-memory approximate nearest-neighbor search — no Docker, no server required |
| Real-Time 5-Step Progress | Live pipeline dashboard: Extract → Chunk → RAPTOR → FAISS → BM25 |
| Feature | Details |
|---|---|
| 3-Page Navigation | Upload & Index → Chat → Sources & PDF Preview |
| Collapsible Sidebar | Hamburger ☰ menu with navigation, theme switcher, document info, action buttons |
| 4 Color Themes | Dark, Light, Indigo, Teal — switches instantly with no reload |
| ChatGPT-Style Chat Interface | Bot on left with markdown rendering, user on right with gradient bubbles |
| Full-Height PDF Viewer | Native browser PDF rendering with zoom, scroll, search toolbar |
| No Full-Page Scroll | Fixed viewport layout — chat and sources scroll independently |
| Desktop-First Design | Centered 860px max-width container — optimized for widescreen monitors |
| Professional Topbar | Fixed navigation bar with LLM status badge and indexed status indicator |
| Document Summary Card | Pages, chunks, RAPTOR nodes, and file size shown after indexing |
┌─────────────────────────────────────────────────────────────────┐
│ USER UPLOADS PDF │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────▼──────────────┐
│ 1. TEXT EXTRACTION │
│ PyMuPDF → OCR fallback │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 2. SENTENCE CHUNKING │
│ NLTK → ~100 tokens/chunk │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 3. SBERT EMBEDDINGS │
│ all-MiniLM-L6-v2 384-dim │
└──────────────┬──────────────┘
│
┌─────────────────▼─────────────────┐
│ 4. RAPTOR INDEXING │
│ Level 0: Original chunks │
│ ↓ GMM soft clustering │
│ Level 1: LLM cluster summaries │
│ ↓ Re-embed + cluster again │
│ Level 2: High-level abstractions │
│ All nodes stored for retrieval │
└──────────┬──────────────┬──────────┘
│ │
┌──────────▼───┐ ┌───────▼──────┐
│ FAISS Index │ │ BM25 Index │
│ Dense vecs │ │ Whoosh KW │
└──────────────┘ └──────────────┘
USER ASKS A QUESTION
│
┌──────────────▼──────────────┐
│ WordNet Query Expansion │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────────────┐
│ HYBRID RETRIEVAL │
│ BM25 Top-20 + FAISS Top-20 │
│ RRF Re-Ranking │
│ → Final Top-8 │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────────────┐
│ LLM ANSWER GENERATION │
│ 1st → Groq Llama 3.3 70B (free, fast) │
│ 2nd → Groq Llama 3.1 70B │
│ 3rd → Groq Llama 3 8B │
│ 4th → OpenAI GPT-3.5 Turbo │
│ 5th → Wikipedia API (live articles) │
│ 6th → Raw context excerpt (last resort) │
└──────────────────────────────────────────────┘
ResolvaBot-LLM/
│
├── 📄 app.py # Main Streamlit app — UI, routing, 3 page views
├── 📋 requirements.txt # All Python dependencies with pinned versions
├── 🔧 setup.sh # Automated one-command setup (Linux / macOS)
├── 🔧 setup.bat # Automated one-command setup (Windows)
├── 🧪 test_pipeline.py # Full pipeline validation — no API key needed
├── 🔐 .env.example # API key template — copy to .env
├── 📖 README.md # This documentation file
│
└── 📂 src/ # Core backend modules
├── 🔍 extraction.py # PyMuPDF PDF text extraction + Tesseract OCR fallback
├── ✂️ chunking.py # NLTK sentence-aware text chunking (~100 token chunks)
├── 🧠 embeddings.py # SBERT all-MiniLM-L6-v2 embedding generation
├── 🌲 raptor_index.py # RAPTOR: GMM clustering + recursive LLM summarization
├── ⚡ vector_store.py # FAISS in-memory vector database (no Docker required)
├── 🔎 retrieval.py # Hybrid BM25 + Dense + RRF re-ranking + WordNet expansion
└── 💬 question_answering.py # Multi-model LLM with 5-level auto-fallback chain
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 4 GB | 8 GB+ |
| Disk | 3 GB (models + deps) | 5 GB+ |
| CPU | Any modern dual-core | Quad-core+ |
| GPU | Not required | Optional (speeds embeddings) |
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Windows 10, macOS 11, Ubuntu 20.04 | Any modern 64-bit OS |
| Python | 3.9 | 3.11+ |
| Browser | Chrome 90+, Firefox 88+ | Chrome (best PDF viewer support) |
| Internet | Required for 1st run (SBERT download) | Broadband |
Linux / macOS:
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
bash setup.shWindows:
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
setup.batThe script automatically creates a virtualenv, installs all dependencies, downloads NLTK data, and creates your
.envfile.
# 1. Clone repository
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
# 2. Create virtual environment
python -m venv venv
# 3. Activate
source venv/bin/activate # Linux / macOS
# venv\Scripts\activate.bat # Windows
# 4. Install all dependencies
pip install -r requirements.txt
# 5. Download NLTK required data
python -c "
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
"
# 6. Set up environment file
cp .env.example .env
# Then edit .env and add your API keys# Ubuntu / Debian
sudo apt-get install tesseract-ocr
# macOS (Homebrew)
brew install tesseract
# Windows
# Download installer: https://github.com/UB-Mannheim/tesseract/wikiTesseract is optional — only needed if your PDFs are scanned images rather than digital text.
Groq provides free access to Llama 3.3 70B — blazing fast, highest quality, no billing required.
- Sign up at console.groq.com — free, no credit card
- Navigate to API Keys → Create API Key
- Add to your
.envfile:
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx- Sign up at platform.openai.com/api-keys
- Add to
.env:
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxThe app runs fully without any API key. Answers come from Wikipedia when the PDF context is insufficient. Ideal for testing the pipeline or offline use.
Your Question
│
▼
Groq Llama 3.3 70B ──(fail)──► Groq Llama 3.1 70B
│
(fail)───▼
Groq Llama 3 8B
│
(fail)───▼
OpenAI GPT-3.5 Turbo
│
(fail)───▼
Wikipedia Live Article
│
(fail)───▼
Raw context excerpt
# Activate virtual environment first
source venv/bin/activate # Linux / macOS
# venv\Scripts\activate.bat # Windows
# Optional: Verify the pipeline works without API key
python test_pipeline.py
# ✅ All 6 tests passed! Pipeline is working correctly.
# Launch the app
streamlit run app.pyOpen http://localhost:8501 in your browser.
💡 First run note: SBERT downloads
all-MiniLM-L6-v2(~90 MB) on first startup. This takes 1-2 minutes once. All subsequent runs start in seconds.
① Drag & drop any PDF (up to 200 MB) or click to browse
② Watch the real-time 5-step pipeline execute:
📝 Extract Text → PyMuPDF reads all pages
✂️ Chunk Text → NLTK splits into ~100-token chunks
🌲 RAPTOR Index → GMM clusters + LLM summarizes (2 levels)
⚡ FAISS Store → Dense embeddings indexed
📑 BM25 Index → Keyword inverted index built
③ Document Summary card appears showing:
• Total pages extracted
• Number of text chunks
• RAPTOR tree nodes (chunks + summaries)
• File size
• Embedding model used
④ Click "💬 Start Chatting →" to open Chat page
① Type any question about the textbook content
② ResolvaBot retrieves the 8 most relevant passages using hybrid search
③ Llama 3.3 70B synthesizes a structured answer including:
• Concept/intuition explained first
• Step-by-step details
• Working code examples with syntax highlighting
• Time & space complexity analysis
• Source attribution (which model answered, from textbook or Wikipedia)
④ Click "📋 N source passages" expander to inspect retrieved passages
• Each passage shows: page number, relevance score, full text
• Code passages render with syntax highlighting
① Full-height PDF viewer with native browser controls:
• Page navigation
• Zoom in/out
• Text search within PDF
• Download option
② The PDF viewer renders the currently loaded document
③ Switch back to Chat to ask more questions
NAVIGATION
📂 Upload & Index → Go to upload page
💬 Chat → Go to chat page
🔍 Sources & PDF → View PDF
COLOR THEME
🔵 ⬛ 🟣 🟢 🔴 🟠 ⚫ → 7 theme swatches
DOCUMENT (when PDF is loaded)
📄 filename.pdf
296 pages · 1260 chunks · 1.0 MB
📂 Unload PDF → Clear and start fresh
🗑 Clear Chat → Reset conversation history
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) solves a fundamental limitation of standard RAG:
Standard RAG: Only retrieves from leaf-level chunks. Fails at "big picture" or cross-chapter questions.
RAPTOR: Builds a tree. Retrieves from all levels. Answers both specific facts AND thematic questions.
Original Chunks (Level 0)
"Binary search runs in O(log n)..."
"A sorted array is required for binary search..."
"The mid-point is calculated as (lo + hi) / 2..."
│
▼ GMM Soft Clustering
Cluster A: 3 chunks about binary search
Cluster B: 4 chunks about sorting algorithms
│
▼ LLM Summarize each cluster
Summary A: "Binary search is a divide-and-conquer algorithm..."
Summary B: "Sorting algorithms order elements using comparisons..."
│
▼ Re-embed, cluster again (Level 2)
Root summary: "Chapter 3 covers search and sorting algorithms..."
The final FAISS index contains all nodes — original chunks + every level of summaries. This means a question like "What is this chapter about?" can be answered from root nodes, while "What is the time complexity of binary search?" is answered from leaf nodes.
| Situation | BM25 Wins | Dense Wins |
|---|---|---|
"What is quicksort?" |
✅ Exact keyword match | |
| "What is the fast sorting method?" | ✅ Semantic similarity | |
| "O(n log n) algorithm" | ✅ Exact notation | |
| "Efficient ordering technique" | ✅ Conceptual match |
Reciprocal Rank Fusion mathematically combines both rankings:
RRF_score(doc) = Σ 1 / (60 + rank_in_list_i)
The constant 60 prevents high-ranking documents from dominating completely, giving fair weight to documents that appear in both lists.
| Property | Value |
|---|---|
| Model | sentence-transformers/all-MiniLM-L6-v2 |
| Dimensions | 384 |
| Download size | ~90 MB |
| Inference | CPU (no GPU required) |
| Similarity metric | Cosine distance |
The system prompt enforces structured, educational responses:
- Concept first — explain the intuition before the details
- Code examples — all algorithm questions get working code in the correct language
- Complexity analysis — time and space complexity always mentioned for algorithms
- Markdown structure — headers, bullets, numbered steps for organization
- Minimum depth — 4-10 sentences minimum, never one-line answers
- No raw context dumping — always synthesizes, never pastes
Switch anytime via ☰ Menu → Color Theme:
| Theme | Accent | Background | Best For |
|---|---|---|---|
| 🌑 Dark | #58a6ff Blue |
#0d1117 Almost Black |
Extended sessions, night use |
| ☀️ Light | #0969da Blue |
#f6f8fa Off-White |
Bright environments, printing |
| 🔮 Indigo | #7c6af7 Purple |
#0f0e17 Deep Purple |
Creative / focused work |
| 🌊 Teal | #2dd4bf Teal |
#0a1628 Deep Blue |
High contrast reading |
All theme variables cascade through the entire UI — topbar, sidebar, cards, chat bubbles, code blocks, progress bars, and scrollbars all update immediately.
# Activate venv first, then:
python test_pipeline.pyWhat each test validates:
| # | Test | Checks |
|---|---|---|
| ① | PDF Extraction | PyMuPDF reads text from a sample PDF correctly |
| ② | Text Chunking | NLTK splits text into correct chunk sizes with sentence boundaries |
| ③ | SBERT Embeddings | Model loads and produces correct 384-dim vectors |
| ④ | RAPTOR Indexing | GMM clustering runs and produces multi-level node tree |
| ⑤ | FAISS Store | Vectors stored and nearest-neighbor search returns correct results |
| ⑥ | BM25 Index | Whoosh indexes text and returns keyword matches |
Expected output:
✅ Test 1 — PDF Extraction: PASSED
✅ Test 2 — Text Chunking: PASSED
✅ Test 3 — SBERT Embeddings: PASSED
✅ Test 4 — RAPTOR Indexing: PASSED
✅ Test 5 — FAISS Store: PASSED
✅ Test 6 — BM25 Index: PASSED
────────────────────────────────────
✅ All 6 tests passed! Pipeline is working correctly.
❌ ModuleNotFoundError on startup
The virtual environment may not be activated or dependencies weren't installed:
# Activate venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate.bat # Windows
# Reinstall
pip install -r requirements.txt❌ NLTK punkt_tab / wordnet errors
python -c "
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
"❌ Groq / OpenAI API errors
- Confirm your key is in
.envwith no extra spaces or quotes - Verify the key is valid at the provider's dashboard
- The app automatically falls back to Wikipedia if any key is invalid — answers still work
❌ Slow first run (5+ minutes)
This is expected on the very first run only. SBERT downloads all-MiniLM-L6-v2 (~90 MB). Once cached in ~/.cache/huggingface/, all subsequent runs start in seconds.
❌ PDF shows "No readable text found"
Your PDF is a scanned image rather than digital text. Install Tesseract:
# Ubuntu
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows: https://github.com/UB-Mannheim/tesseract/wikiThen re-upload the PDF — OCR activates automatically.
❌ Out of memory during RAPTOR indexing
For very large PDFs (500+ pages), RAPTOR may use significant RAM. The pipeline processes in batches automatically. If you hit limits, the app continues with fewer RAPTOR levels and still works correctly.
❌ Chat input doesn't respond
Hard-refresh your browser (Ctrl+Shift+R or Cmd+Shift+R) to clear any cached Streamlit state.
❌ PDF viewer not showing the document
Chrome has the best support for the base64 PDF data URI used by the viewer. If using Firefox or Edge and the viewer appears blank, try switching to Chrome.
# Core
numpy >= 1.24.0
pandas >= 2.0.0
# PDF Extraction
pymupdf >= 1.23.0
pytesseract >= 0.3.10
Pillow >= 10.0.0
# NLP & Text Processing
nltk >= 3.8.1
# Embeddings & ML
sentence-transformers >= 2.2.2
transformers >= 4.33.0
torch >= 2.0.0
scikit-learn >= 1.3.0
faiss-cpu >= 1.7.4
# BM25 Keyword Index
whoosh >= 2.7.4
# LLM Providers
groq # Free Llama 3.3 70B
openai >= 1.0.0 # GPT-3.5 Turbo fallback
# Wikipedia Fallback
wikipedia-api >= 0.6.0
# Environment & Config
python-dotenv >= 1.0.0
# Web UI
streamlit >= 1.28.0
Install all at once:
pip install -r requirements.txtContributions, bug reports, and feature requests are very welcome!
# 1. Fork the repository on GitHub
# 2. Clone your fork
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
# 3. Create a feature branch
git checkout -b feature/your-feature-name
# 4. Make your changes and test
python test_pipeline.py
# 5. Commit and push
git commit -m "feat: describe your change clearly"
git push origin feature/your-feature-name
# 6. Open a Pull Request on GitHubIdeas welcome for contribution:
- 📁 Support for DOCX, EPUB, HTML, Markdown files
- 🤖 Additional LLM providers (Anthropic Claude, Google Gemini, local Ollama)
- 💾 Persistent vector store (save index to disk, reload on next session)
- 📤 Export chat history as PDF or Markdown
- 🗂️ Multi-PDF sessions (query across multiple books simultaneously)
- 🐳 Docker containerization for easier deployment
- 🌍 Multi-language PDF support