Skip to content

Alibaba-NLP/VRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

✨Moving Towards Next-Generation RAG via Multi-Modal Agentic Reinforcement Learning

A Multi-Turn Multi-Modal Agent Training Framework




πŸ“‘ Table of Contents


πŸ”₯ News

  • ⏳ The project is still under ongoing development, and the training code of VimRAG will be available after being reviewed by the company.
  • πŸŽ‰ We have released the report of the VimRAG.
  • πŸŽ‰ We have released the retriever based on FAISS, enabling retrieval with GVE embedding and Qwen3-VL-Embedding.
  • πŸŽ‰ We have released the demo of VRAG-RL, allowing you to customize your own VRAG.
  • πŸŽ‰ Our framework integrates SOTA visual embedding models, enabling you to create your own retriever.

πŸš€ Overview & New Feature

  • We introduce VimRAG, a novel framework tailored for multimodal Retrieval-Augmented Reasoning across text, images, and videos.
  • We propose the Multimodal Memory Graph and Graph-Guided Policy Optimization, modeling the reasoning process as a dynamic directed acyclic graph, and by pruning memory nodes associated with redundant actions, GGPO enables fine-grained credit assignment and accelerates training convergence.
  • We introduce VRAG, a purely visual RAG agent that enables VLMs to progressively gather information from a coarse-grained to a fine-grained perspective.
  • We have released the training framework of VRAG-RL, a novel multi-turn and multimodal training framework with strong extensibility, capable of supporting training with various tools.

βš™οΈ Dependencies

# Create environment
conda create -n vrag python=3.10
# Clone project
git clone https://github.com/alibaba-nlp/VRAG.git
cd VRAG
# Install dependencies for demo and retriever
pip install -r requirements.txt

πŸš€ Quick Start

Please refer to run_demo.sh to quickly start the demo. Below is a step-by-step guide to help you run the demo on our example data.

One-Command Launch

# VimRAG (API-based, recommended for quick start)
export DASHSCOPE_API_KEY=your_api_key
./run_demo.sh vimrag
# VRAG (Local model, requires A100 80G)
./run_demo.sh vrag
# Search engine only
./run_demo.sh search

πŸ” Build Your Own Retriever

Step 1: Prepare Corpus

Images: Place image files directly in the corpus directory:

cp /path/to/your/images/*.jpg search_engine/corpus/image/

PDFs: Convert PDF documents to images:

mkdir -p search_engine/corpus/pdf
cp /path/to/your/documents/*.pdf search_engine/corpus/pdf/
python search_engine/corpus/pdf2images.py

Videos: Split long videos into smaller chunks:

./search_engine/corpus/splitVideo.sh -i /path/to/videos -o search_engine/corpus/video -d 60

Step 2: Build Index

Supported Embedding Models:

Model Dimension Notes
Alibaba-NLP/GVE-3B 2048 Qwen2.5-VL-based embedding
Alibaba-NLP/GVE-7B 3584 Higher quality, more VRAM
Qwen/Qwen3-VL-Embedding-2B 2048 Qwen3-VL-based embedding
Qwen/Qwen3-VL-Embedding-8B 4096 Higher quality, more VRAM

Build the Index:

from search_engine.search_engine import SearchEngine

# Initialize with your chosen embedding model
engine = SearchEngine("/path/to/Qwen3-VL-Embedding-2B")

# Build index from your corpus
engine.build_index(
    input_dir="search_engine/corpus/image",
    index_output_path="search_engine/corpus/image_index",
    corpus_output_path="search_engine/corpus/image_index",
    bs=16  # Adjust based on memory
)

Note: The index is automatically saved periodically. If interrupted, re-running will resume from the last checkpoint.

Step 3: Start Search Engine API

Edit search_engine/search_engine_api.py to configure paths:

model_path = "/path/to/your/embedding/model"
corpus_path = ["search_engine/corpus/image_index"]

Launch the API server:

python search_engine/search_engine_api.py

Test the endpoint:

curl -X POST http://localhost:8001/search \
    -H "Content-Type: application/json" \
    -d '{"queries": ["your search query"], "top_k": 3}'

πŸ’» Run Demo

VimRAG Demo (Recommended)

VimRAG uses Qwen3.5-Plus via DashScope API β€” no local GPU required for model inference.

Features:

  • Real-time DAG visualization of reasoning process
  • Multimodal memory graph
  • Extended thinking mode
  • Streaming output

Launch:

export DASHSCOPE_API_KEY=your_api_key
./run_demo.sh vimrag

Manual Launch:

# Terminal 1: Start search engine
python search_engine/search_engine_api.py

# Terminal 2: Launch Streamlit demo
streamlit run demo/vimrag_app.py

Configuration Options:

Option Default Description
API Base URL https://dashscope.aliyuncs.com/compatible-mode/v1 DashScope Qwen API endpoint
Search Engine URL http://localhost:8001/search Local search engine endpoint
Model qwen3.5-plus Model to use (supports multimodal reasoning)
Max Steps 20 Maximum reasoning iterations
Search Top-K 3 Number of results per search

Programmatic Usage:

import os
from demo.vimrag_agent import VimRAG

agent = VimRAG(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    search_url="http://localhost:8001/search",
    model_name="qwen3.5-plus",
    api_key=os.environ.get("DASHSCOPE_API_KEY"),
    enable_thinking=True
)

for event in agent.run({"query": "Your question here"}):
    if event["event"] == "answer":
        print(event["content"])

VRAG Demo (Local Model)

GIF1 GIF2 GIF3
VRAG.mp4
VRAG.mp4

VRAG uses a locally deployed Qwen2.5-VL-7B model via vLLM.

Launch:

./run_demo.sh vrag

Manual Launch:

# Terminal 1: Start search engine (port 8001)
python search_engine/search_engine_api.py

# Terminal 2: Start vLLM server (port 8002)
vllm serve autumncc/Qwen2.5-VL-7B-VRAG \
    --port 8002 \
    --host 0.0.0.0 \
    --limit-mm-per-prompt image=10 \
    --served-model-name Qwen/Qwen2.5-VL-7B-Instruct

# Terminal 3: Launch Streamlit demo
streamlit run demo/app.py

Programmatic Usage:

from demo.vrag_agent import VRAG

vrag = VRAG(
    base_url="http://0.0.0.0:8002/v1",
    search_url="http://0.0.0.0:8001/search",
    generator=False,
    api_key="EMPTY"
)

answer = vrag.run("Your question here")

βš™οΈ Model Training

VRAG-RL

Training code for VRAG-RL is available in the VRAG-RL/ directory.

Installation:

cd VRAG-RL
pip install -e .
pip install -r requirements_train.txt

Start Training:

./train_grpo_qwen2_5_vl_7b.sh

See VRAG-RL/README.md for detailed training instructions.

VimRAG

Note: VimRAG training code (Qwen3-VL) will be released after company review.


πŸ“ Project Structure

VRAG/
β”œβ”€β”€ demo/                      # Demo applications
β”‚   β”œβ”€β”€ app.py                 # VRAG Streamlit demo
β”‚   β”œβ”€β”€ vimrag_app.py          # VimRAG Streamlit demo
β”‚   β”œβ”€β”€ vrag_agent.py          # VRAG agent implementation
β”‚   └── vimrag_agent.py        # VimRAG agent implementation
β”œβ”€β”€ search_engine/             # Multimodal retrieval engine
β”‚   β”œβ”€β”€ models/                # Embedding models (GVE, Qwen3-VL)
β”‚   β”œβ”€β”€ corpus/                # Corpus and index storage
β”‚   β”œβ”€β”€ search_engine.py       # Core search engine
β”‚   └── search_engine_api.py   # FastAPI server
β”œβ”€β”€ VRAG-RL/                   # RL training framework
β”‚   β”œβ”€β”€ verl/                  # Training infrastructure
β”‚   β”œβ”€β”€ vrag_agent/            # Agent for training
β”‚   └── train_grpo_qwen2_5_vl_7b.sh
β”œβ”€β”€ assets/                    # Images and demos
β”œβ”€β”€ requirements.txt           # Dependencies
└── run_demo.sh                # Launch script

πŸ™ Acknowledgments

This work is implemented based on ViDoRAG, LLaMA-Factory, Search-R1, and verl. We greatly appreciate their valuable contributions to the community.


πŸ“ Citation

@article{wang2025vrag,
  title={Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning},
  author={Wang, Qiuchen and Ding, Ruixue and Zeng, Yu and Chen, Zehui and Chen, Lin and Wang, Shihang and Xie, Pengjun and Huang, Fei and Zhao, Feng},
  journal={arXiv preprint arXiv:2505.22019},
  year={2025}
}

@article{wang2026vimrag,
  title={VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph},
  author={Wang, Qiuchen and Wang, Shihang and Zeng, Yu and Zhang, Qiang and Zhang, Fanrui and Guo, Zhuoning and Zhang, Bosi and Huang, Wenxuan and Chen, Lin and Chen, Zehui and others},
  journal={arXiv preprint arXiv:2602.12735},
  year={2026}
}

πŸ”— Related Projects

Explore our additional research on Visual Retrieval-Augmented Generation:

ViDoRAG GitHub stars

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents. A novel RAG framework that utilizes a multi-agent, actor-critic paradigm for iterative reasoning, enhancing the noise robustness of generation models.


Made with ❀️ by Alibaba NLP

About

Multimodal Retrieval-augmented Generation Framework Built by Tongyi Lab, Alibaba Group.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages