📚 ResolvaBot LLM

AI-Powered Textbook Intelligence — Ask Anything, Get Expert Answers

A next-generation PDF Q&A platform combining RAPTOR hierarchical indexing, Hybrid BM25 + Dense retrieval, and Llama 3.3 70B via Groq to turn any textbook into an interactive AI tutor — with source citations, code examples, and complexity analysis in every answer.

✨ Upload a PDF → Index in seconds → Get expert AI answers with source citations ✨

🚀 What is ResolvaBot?

ResolvaBot LLM is a desktop-first, locally-hosted AI study assistant that transforms any PDF textbook into a fully searchable, intelligent Q&A knowledge base. It combines cutting-edge retrieval techniques with large language models to deliver structured, cited, code-inclusive answers — just like having a senior tutor available 24/7.

Who is it built for?

Audience	Use Case
🎓 Students	Ask questions directly from CS / algorithms / science textbooks
👨‍🏫 Educators	Surface relevant passages for lecture preparation instantly
🔬 Researchers	Query technical PDFs, papers, and manuals with precision
💻 Developers	Understand codebases by uploading technical documentation

✨ Features

🤖 AI & Retrieval Engine

Feature	Details
RAPTOR Hierarchical Index	Recursive GMM clustering + LLM summarization — builds a tree-structured index for multi-granularity retrieval (leaf → branch → root)
Hybrid BM25 + Dense Search	Keyword matching (Whoosh BM25) + SBERT semantic vectors (FAISS cosine) working together
Reciprocal Rank Fusion (RRF)	Merges and re-ranks results from both retrieval branches for maximum combined relevance
WordNet Query Expansion	Automatically expands queries with synonyms to improve recall on paraphrased content
Multi-Model LLM Auto-Fallback	5-level fallback: Groq Llama 3.3 70B → Llama 3.1 70B → Llama 3 8B → OpenAI GPT-3.5 → Wikipedia → Raw context
Wikipedia Live Fallback	When PDF context is insufficient, fetches real-time Wikipedia articles to answer
Source Passage Attribution	Every answer shows which passages were retrieved and their exact RRF relevance scores
Markdown + Code Answers	All LLM responses render full markdown — headers, code blocks with syntax highlighting, tables, bold

📄 PDF Processing Pipeline

Feature	Details
PyMuPDF Extraction	Fast, accurate text extraction from any digital PDF
Tesseract OCR Fallback	Automatically uses OCR for scanned / image-based PDFs when detected
NLTK Sentence Chunking	Smart chunking that preserves sentence boundaries (~100 tokens per chunk)
SBERT Embeddings	`all-MiniLM-L6-v2` — 384-dimensional semantic embeddings, CPU-fast
FAISS Vector Store	In-memory approximate nearest-neighbor search — no Docker, no server required
Real-Time 5-Step Progress	Live pipeline dashboard: Extract → Chunk → RAPTOR → FAISS → BM25

🖥️ UI & User Experience

Feature	Details
3-Page Navigation	Upload & Index → Chat → Sources & PDF Preview
Collapsible Sidebar	Hamburger ☰ menu with navigation, theme switcher, document info, action buttons
4 Color Themes	Dark, Light, Indigo, Teal — switches instantly with no reload
ChatGPT-Style Chat Interface	Bot on left with markdown rendering, user on right with gradient bubbles
Full-Height PDF Viewer	Native browser PDF rendering with zoom, scroll, search toolbar
No Full-Page Scroll	Fixed viewport layout — chat and sources scroll independently
Desktop-First Design	Centered 860px max-width container — optimized for widescreen monitors
Professional Topbar	Fixed navigation bar with LLM status badge and indexed status indicator
Document Summary Card	Pages, chunks, RAPTOR nodes, and file size shown after indexing

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      USER UPLOADS PDF                           │
└──────────────────────────┬──────────────────────────────────────┘
                           │
            ┌──────────────▼──────────────┐
            │   1. TEXT EXTRACTION        │
            │   PyMuPDF → OCR fallback    │
            └──────────────┬──────────────┘
                           │
            ┌──────────────▼──────────────┐
            │   2. SENTENCE CHUNKING      │
            │   NLTK → ~100 tokens/chunk  │
            └──────────────┬──────────────┘
                           │
            ┌──────────────▼──────────────┐
            │   3. SBERT EMBEDDINGS       │
            │   all-MiniLM-L6-v2 384-dim  │
            └──────────────┬──────────────┘
                           │
         ┌─────────────────▼─────────────────┐
         │        4. RAPTOR INDEXING          │
         │  Level 0: Original chunks          │
         │    ↓ GMM soft clustering           │
         │  Level 1: LLM cluster summaries    │
         │    ↓ Re-embed + cluster again      │
         │  Level 2: High-level abstractions  │
         │  All nodes stored for retrieval    │
         └──────────┬──────────────┬──────────┘
                    │              │
         ┌──────────▼───┐  ┌───────▼──────┐
         │  FAISS Index │  │  BM25 Index  │
         │  Dense vecs  │  │  Whoosh KW   │
         └──────────────┘  └──────────────┘

               USER ASKS A QUESTION
                        │
         ┌──────────────▼──────────────┐
         │   WordNet Query Expansion   │
         └──────────────┬──────────────┘
                        │
         ┌──────────────▼──────────────────────┐
         │         HYBRID RETRIEVAL            │
         │  BM25 Top-20  +  FAISS Top-20       │
         │         RRF Re-Ranking              │
         │         → Final Top-8               │
         └──────────────┬──────────────────────┘
                        │
         ┌──────────────▼──────────────────────────────┐
         │         LLM ANSWER GENERATION               │
         │  1st → Groq Llama 3.3 70B (free, fast)      │
         │  2nd → Groq Llama 3.1 70B                   │
         │  3rd → Groq Llama 3 8B                      │
         │  4th → OpenAI GPT-3.5 Turbo                 │
         │  5th → Wikipedia API (live articles)        │
         │  6th → Raw context excerpt (last resort)    │
         └──────────────────────────────────────────────┘

📁 Project Structure

ResolvaBot-LLM/
│
├── 📄 app.py                      # Main Streamlit app — UI, routing, 3 page views
├── 📋 requirements.txt            # All Python dependencies with pinned versions
├── 🔧 setup.sh                    # Automated one-command setup (Linux / macOS)
├── 🔧 setup.bat                   # Automated one-command setup (Windows)
├── 🧪 test_pipeline.py            # Full pipeline validation — no API key needed
├── 🔐 .env.example                # API key template — copy to .env
├── 📖 README.md                   # This documentation file
│
└── 📂 src/                        # Core backend modules
    ├── 🔍 extraction.py           # PyMuPDF PDF text extraction + Tesseract OCR fallback
    ├── ✂️  chunking.py             # NLTK sentence-aware text chunking (~100 token chunks)
    ├── 🧠 embeddings.py           # SBERT all-MiniLM-L6-v2 embedding generation
    ├── 🌲 raptor_index.py         # RAPTOR: GMM clustering + recursive LLM summarization
    ├── ⚡ vector_store.py          # FAISS in-memory vector database (no Docker required)
    ├── 🔎 retrieval.py            # Hybrid BM25 + Dense + RRF re-ranking + WordNet expansion
    └── 💬 question_answering.py   # Multi-model LLM with 5-level auto-fallback chain

⚙️ System Requirements

Hardware

Component	Minimum	Recommended
RAM	4 GB	8 GB+
Disk	3 GB (models + deps)	5 GB+
CPU	Any modern dual-core	Quad-core+
GPU	Not required	Optional (speeds embeddings)

Software

Requirement	Minimum	Recommended
OS	Windows 10, macOS 11, Ubuntu 20.04	Any modern 64-bit OS
Python	3.9	3.11+
Browser	Chrome 90+, Firefox 88+	Chrome (best PDF viewer support)
Internet	Required for 1st run (SBERT download)	Broadband

🛠️ Installation

Method 1 — Automated Setup ⭐ Recommended

Linux / macOS:

git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
bash setup.sh

Windows:

git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM
setup.bat

The script automatically creates a virtualenv, installs all dependencies, downloads NLTK data, and creates your .env file.

Method 2 — Manual Step-by-Step

# 1. Clone repository
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM

# 2. Create virtual environment
python -m venv venv

# 3. Activate
source venv/bin/activate        # Linux / macOS
# venv\Scripts\activate.bat    # Windows

# 4. Install all dependencies
pip install -r requirements.txt

# 5. Download NLTK required data
python -c "
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
"

# 6. Set up environment file
cp .env.example .env
# Then edit .env and add your API keys

Optional: Install Tesseract OCR (for scanned PDFs)

# Ubuntu / Debian
sudo apt-get install tesseract-ocr

# macOS (Homebrew)
brew install tesseract

# Windows
# Download installer: https://github.com/UB-Mannheim/tesseract/wiki

Tesseract is optional — only needed if your PDFs are scanned images rather than digital text.

🔑 API Keys Setup

Option A — Groq (FREE & Recommended ⭐)

Groq provides free access to Llama 3.3 70B — blazing fast, highest quality, no billing required.

Sign up at console.groq.com — free, no credit card
Navigate to API Keys → Create API Key
Add to your .env file:

GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Option B — OpenAI GPT-3.5 (Paid, Fallback)

Sign up at platform.openai.com/api-keys
Add to .env:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Option C — No API Key (Wikipedia Mode)

The app runs fully without any API key. Answers come from Wikipedia when the PDF context is insufficient. Ideal for testing the pipeline or offline use.

Automatic LLM Fallback Chain

Your Question
     │
     ▼
Groq Llama 3.3 70B ──(fail)──► Groq Llama 3.1 70B
                                       │
                              (fail)───▼
                          Groq Llama 3 8B
                                       │
                              (fail)───▼
                          OpenAI GPT-3.5 Turbo
                                       │
                              (fail)───▼
                          Wikipedia Live Article
                                       │
                              (fail)───▼
                          Raw context excerpt

▶️ Running the App

# Activate virtual environment first
source venv/bin/activate          # Linux / macOS
# venv\Scripts\activate.bat      # Windows

# Optional: Verify the pipeline works without API key
python test_pipeline.py
# ✅ All 6 tests passed! Pipeline is working correctly.

# Launch the app
streamlit run app.py

Open http://localhost:8501 in your browser.

💡 First run note: SBERT downloads all-MiniLM-L6-v2 (~90 MB) on first startup. This takes 1-2 minutes once. All subsequent runs start in seconds.

🖥️ Using the Interface

Page 1 — Upload & Index

① Drag & drop any PDF (up to 200 MB) or click to browse
② Watch the real-time 5-step pipeline execute:

   📝 Extract Text     → PyMuPDF reads all pages
   ✂️  Chunk Text       → NLTK splits into ~100-token chunks
   🌲 RAPTOR Index     → GMM clusters + LLM summarizes (2 levels)
   ⚡ FAISS Store      → Dense embeddings indexed
   📑 BM25 Index       → Keyword inverted index built

③ Document Summary card appears showing:
   • Total pages extracted
   • Number of text chunks
   • RAPTOR tree nodes (chunks + summaries)
   • File size
   • Embedding model used

④ Click "💬 Start Chatting →" to open Chat page

Page 2 — Chat

① Type any question about the textbook content
② ResolvaBot retrieves the 8 most relevant passages using hybrid search
③ Llama 3.3 70B synthesizes a structured answer including:
   • Concept/intuition explained first
   • Step-by-step details
   • Working code examples with syntax highlighting
   • Time & space complexity analysis
   • Source attribution (which model answered, from textbook or Wikipedia)

④ Click "📋 N source passages" expander to inspect retrieved passages
   • Each passage shows: page number, relevance score, full text
   • Code passages render with syntax highlighting

Page 3 — Sources & PDF

① Full-height PDF viewer with native browser controls:
   • Page navigation
   • Zoom in/out
   • Text search within PDF
   • Download option
② The PDF viewer renders the currently loaded document
③ Switch back to Chat to ask more questions

Sidebar Menu (click ☰ top-left)

NAVIGATION
  📂 Upload & Index   → Go to upload page
  💬 Chat             → Go to chat page
  🔍 Sources & PDF    → View PDF

COLOR THEME
  🔵 ⬛ 🟣 🟢 🔴 🟠 ⚫    → 7 theme swatches

DOCUMENT (when PDF is loaded)
  📄 filename.pdf
  296 pages · 1260 chunks · 1.0 MB
  📂 Unload PDF       → Clear and start fresh
  🗑 Clear Chat       → Reset conversation history

🔬 Technical Deep Dive

RAPTOR Indexing — How It Works

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) solves a fundamental limitation of standard RAG:

Standard RAG: Only retrieves from leaf-level chunks. Fails at "big picture" or cross-chapter questions.

RAPTOR: Builds a tree. Retrieves from all levels. Answers both specific facts AND thematic questions.

Original Chunks (Level 0)
  "Binary search runs in O(log n)..."
  "A sorted array is required for binary search..."
  "The mid-point is calculated as (lo + hi) / 2..."
         │
         ▼ GMM Soft Clustering
  Cluster A: 3 chunks about binary search
  Cluster B: 4 chunks about sorting algorithms
         │
         ▼ LLM Summarize each cluster
  Summary A: "Binary search is a divide-and-conquer algorithm..."
  Summary B: "Sorting algorithms order elements using comparisons..."
         │
         ▼ Re-embed, cluster again (Level 2)
  Root summary: "Chapter 3 covers search and sorting algorithms..."

The final FAISS index contains all nodes — original chunks + every level of summaries. This means a question like "What is this chapter about?" can be answered from root nodes, while "What is the time complexity of binary search?" is answered from leaf nodes.

Hybrid Retrieval — Why Both Methods?

Situation	BM25 Wins	Dense Wins
"What is `quicksort`?"	✅ Exact keyword match
"What is the fast sorting method?"		✅ Semantic similarity
"O(n log n) algorithm"	✅ Exact notation
"Efficient ordering technique"		✅ Conceptual match

Reciprocal Rank Fusion mathematically combines both rankings:

RRF_score(doc) = Σ 1 / (60 + rank_in_list_i)

The constant 60 prevents high-ranking documents from dominating completely, giving fair weight to documents that appear in both lists.

Embedding Model

Property	Value
Model	`sentence-transformers/all-MiniLM-L6-v2`
Dimensions	384
Download size	~90 MB
Inference	CPU (no GPU required)
Similarity metric	Cosine distance

LLM System Prompt Design

The system prompt enforces structured, educational responses:

Concept first — explain the intuition before the details
Code examples — all algorithm questions get working code in the correct language
Complexity analysis — time and space complexity always mentioned for algorithms
Markdown structure — headers, bullets, numbered steps for organization
Minimum depth — 4-10 sentences minimum, never one-line answers
No raw context dumping — always synthesizes, never pastes

🎨 UI Themes

Switch anytime via ☰ Menu → Color Theme:

Theme	Accent	Background	Best For
🌑 Dark	`#58a6ff` Blue	`#0d1117` Almost Black	Extended sessions, night use
☀️ Light	`#0969da` Blue	`#f6f8fa` Off-White	Bright environments, printing
🔮 Indigo	`#7c6af7` Purple	`#0f0e17` Deep Purple	Creative / focused work
🌊 Teal	`#2dd4bf` Teal	`#0a1628` Deep Blue	High contrast reading

All theme variables cascade through the entire UI — topbar, sidebar, cards, chat bubbles, code blocks, progress bars, and scrollbars all update immediately.

🧪 Testing the Pipeline

# Activate venv first, then:
python test_pipeline.py

What each test validates:

#	Test	Checks
①	PDF Extraction	PyMuPDF reads text from a sample PDF correctly
②	Text Chunking	NLTK splits text into correct chunk sizes with sentence boundaries
③	SBERT Embeddings	Model loads and produces correct 384-dim vectors
④	RAPTOR Indexing	GMM clustering runs and produces multi-level node tree
⑤	FAISS Store	Vectors stored and nearest-neighbor search returns correct results
⑥	BM25 Index	Whoosh indexes text and returns keyword matches

Expected output:

✅ Test 1 — PDF Extraction:   PASSED
✅ Test 2 — Text Chunking:    PASSED
✅ Test 3 — SBERT Embeddings: PASSED
✅ Test 4 — RAPTOR Indexing:  PASSED
✅ Test 5 — FAISS Store:      PASSED
✅ Test 6 — BM25 Index:       PASSED
────────────────────────────────────
✅ All 6 tests passed! Pipeline is working correctly.

🔧 Troubleshooting

❌ ModuleNotFoundError on startup

The virtual environment may not be activated or dependencies weren't installed:

# Activate venv
source venv/bin/activate       # Linux / macOS
venv\Scripts\activate.bat      # Windows

# Reinstall
pip install -r requirements.txt

❌ NLTK punkt_tab / wordnet errors

python -c "
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
"

❌ Groq / OpenAI API errors

Confirm your key is in .env with no extra spaces or quotes
Verify the key is valid at the provider's dashboard
The app automatically falls back to Wikipedia if any key is invalid — answers still work

❌ Slow first run (5+ minutes)

This is expected on the very first run only. SBERT downloads all-MiniLM-L6-v2 (~90 MB). Once cached in ~/.cache/huggingface/, all subsequent runs start in seconds.

❌ PDF shows "No readable text found"

Your PDF is a scanned image rather than digital text. Install Tesseract:

# Ubuntu
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows: https://github.com/UB-Mannheim/tesseract/wiki

Then re-upload the PDF — OCR activates automatically.

❌ Out of memory during RAPTOR indexing

For very large PDFs (500+ pages), RAPTOR may use significant RAM. The pipeline processes in batches automatically. If you hit limits, the app continues with fewer RAPTOR levels and still works correctly.

❌ Chat input doesn't respond

Hard-refresh your browser (Ctrl+Shift+R or Cmd+Shift+R) to clear any cached Streamlit state.

❌ PDF viewer not showing the document

Chrome has the best support for the base64 PDF data URI used by the viewer. If using Firefox or Edge and the viewer appears blank, try switching to Chrome.

📦 Full Dependencies List

# Core
numpy >= 1.24.0
pandas >= 2.0.0

# PDF Extraction
pymupdf >= 1.23.0
pytesseract >= 0.3.10
Pillow >= 10.0.0

# NLP & Text Processing
nltk >= 3.8.1

# Embeddings & ML
sentence-transformers >= 2.2.2
transformers >= 4.33.0
torch >= 2.0.0
scikit-learn >= 1.3.0
faiss-cpu >= 1.7.4

# BM25 Keyword Index
whoosh >= 2.7.4

# LLM Providers
groq                     # Free Llama 3.3 70B
openai >= 1.0.0          # GPT-3.5 Turbo fallback

# Wikipedia Fallback
wikipedia-api >= 0.6.0

# Environment & Config
python-dotenv >= 1.0.0

# Web UI
streamlit >= 1.28.0

Install all at once:

pip install -r requirements.txt

🤝 Contributing

Contributions, bug reports, and feature requests are very welcome!

# 1. Fork the repository on GitHub

# 2. Clone your fork
git clone https://github.com/Anamicca23/ResolvaBot-LLM.git
cd ResolvaBot-LLM

# 3. Create a feature branch
git checkout -b feature/your-feature-name

# 4. Make your changes and test
python test_pipeline.py

# 5. Commit and push
git commit -m "feat: describe your change clearly"
git push origin feature/your-feature-name

# 6. Open a Pull Request on GitHub

Ideas welcome for contribution:

📁 Support for DOCX, EPUB, HTML, Markdown files
🤖 Additional LLM providers (Anthropic Claude, Google Gemini, local Ollama)
💾 Persistent vector store (save index to disk, reload on next session)
📤 Export chat history as PDF or Markdown
🗂️ Multi-PDF sessions (query across multiple books simultaneously)
🐳 Docker containerization for easier deployment
🌍 Multi-language PDF support

Built with ❤️ using Python, Streamlit, RAPTOR, FAISS, and Groq

ResolvaBot LLM — Making every textbook infinitely queryable

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.streamlit		.streamlit
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.sh		setup.sh
test_pipeline.py		test_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

📚 ResolvaBot LLM

AI-Powered Textbook Intelligence — Ask Anything, Get Expert Answers

✨ Upload a PDF → Index in seconds → Get expert AI answers with source citations ✨

📋 Table of Contents

🚀 What is ResolvaBot?

✨ Features

🤖 AI & Retrieval Engine

📄 PDF Processing Pipeline

🖥️ UI & User Experience

🏗️ Architecture

📁 Project Structure

⚙️ System Requirements

Hardware

Software

🛠️ Installation

Method 1 — Automated Setup ⭐ Recommended

Method 2 — Manual Step-by-Step

Optional: Install Tesseract OCR (for scanned PDFs)

🔑 API Keys Setup

Option A — Groq (FREE & Recommended ⭐)

Option B — OpenAI GPT-3.5 (Paid, Fallback)

Option C — No API Key (Wikipedia Mode)

Automatic LLM Fallback Chain

▶️ Running the App

🖥️ Using the Interface

Page 1 — Upload & Index

Page 2 — Chat

Page 3 — Sources & PDF

Sidebar Menu (click ☰ top-left)

🔬 Technical Deep Dive

RAPTOR Indexing — How It Works

Hybrid Retrieval — Why Both Methods?

Embedding Model

LLM System Prompt Design

🎨 UI Themes

🧪 Testing the Pipeline

🔧 Troubleshooting

📦 Full Dependencies List

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages