NanoDeploy: LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism

📦 Components

Component	Language	Description	Key Features
DLSlime	C++	RDMA communication	Zero-copy KV cache migration, P2P mesh networking, GPUDirect RDMA
NanoCtrl	Rust	Control plane	Redis-backed service registry, health monitoring, engine discovery, Python client
NanoDeploy	Python/C++	LLM inference engine	Prefill/decode engines, KV cache management, continuous batching, Ray-based distributed workers
NanoDeployVL	Python	Vision-Language encoder	EP-separated ViT encoder, RDMA embedding transfer, Qwen3-VL support
NanoRoute	Rust	HTTP load balancer	OpenAI-compatible API, tool calls, routing strategies, engine discovery

🏗️ Architecture

graph TB
    Client[Client Layer<br/>HTTP Requests / OpenAI SDK]
    Route[NanoRoute<br/>Rust/HTTP<br/>Load Balancer]
    VL[NanoDeployVL<br/>Vision Encoder]
    Prefill[Prefill Engine<br/>Python/C++]
    Decode[Decode Engine<br/>Python/C++]
    Ctrl[NanoCtrl<br/>Redis<br/>Service Registry]

    Client -->|HTTP| Route
    Route -->|ZMQ| VL
    Route -->|ZMQ| Prefill
    Route -->|ZMQ| Decode
    VL -->|RDMA<br/>Embeddings| Prefill
    Prefill -->|RDMA<br/>KV Migration| Decode
    VL -->|Register/Heartbeat| Ctrl
    Prefill -->|Register/Heartbeat| Ctrl
    Decode -->|Register/Heartbeat| Ctrl
    Route -->|Engine Discovery| Ctrl

🚀 Installation

The root pyproject.toml acts as a meta-package that lets you install any combination of Python components in a single command.

One-liner: install everything

pip install ".[all]"

Install individual components

pip install ".[dlslime]"      # DLSlime transfer engine only
pip install ".[nanoctrl]"     # NanoCtrl lifecycle client only
pip install ".[nanodeploy]"   # NanoDeploy inference engine only
pip install ".[nanodeployvl]" # NanoDeployVL vision-language encoder only

For developers

# Build NanoDeploy C++ extensions in-place
cd NanoDeploy && pip install -e . && cd ..

# Build NanoRoute (Rust)
cd NanoRoute && cargo build --release && cd ..

# Build NanoCtrl (Rust) + install Python client
cd NanoCtrl && cargo build --release && pip install -e . && cd ..

Quick Start: PD Disaggregated Serving

Prefill-Decode disaggregation splits prompt processing (prefill) and token generation (decode) across separate GPU nodes connected via RDMA.

Prerequisites

2 nodes with NVIDIA GPUs (SM90+ for FP8), RDMA-capable NICs
Redis, Ray cluster, Rust toolchain

1. Start Ray

# Node 0 (head)
ray start --head --port=7078 --dashboard-host=0.0.0.0

# Node 1 (multi-node only)
ray start --address <node0-ip>:7078

Offline mode

Batch generation without HTTP serving.

Single node (no NanoCtrl needed)

python NanoDeploy/examples/non_disagg.py \
    --model /models/Qwen3-235B-A22B \
    --ray_address <node0-ip>:7078 \
    --master_address <node0-ip>:6006 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 256 \
    --prompt "1+1=?" --max_tokens 128

PD disaggregated (2 nodes)

2. Start Redis + NanoCtrl

redis-server --bind 0.0.0.0 --port 6379
cd NanoCtrl && cargo run --release    # edit config.toml to set redis_url

3. Launch engines

python NanoDeploy/examples/disagg.py \
    --model /models/Qwen3-235B-A22B \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --attention_dp 8 --ffn_ep 8 \
    --prefill.master_address <node0-ip>:6006 \
    --decode.master_address <node1-ip>:6006

Online mode

ZMQ engine servers with OpenAI-compatible HTTP API via NanoRoute.

2. Start Redis + NanoCtrl

redis-server --bind 0.0.0.0 --port 6379
cd NanoCtrl && cargo run --release    # edit config.toml to set redis_url

3. Start NanoRoute

cd NanoRoute && cargo run --release    # edit config.toml to set nanoctrl_address

4. Launch engines

# Terminal 1 — Decode engine
python NanoDeploy/nanodeploy/server/engine_server.py \
    --model /models/Qwen3-235B-A22B \
    --mode decode \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --nanoctrl_scope nanoctrl-0 \
    --master_address <node1-ip>:6006 \
    --host <node0-ip> --port 6001 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 64 \
    --max_num_batched_tokens 16384 --max_model_len 16384

# Terminal 2 — Prefill engine
python NanoDeploy/nanodeploy/server/engine_server.py \
    --model /models/Qwen3-235B-A22B \
    --mode prefill \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --nanoctrl_scope nanoctrl-0 \
    --master_address <node0-ip>:6006 \
    --host <node0-ip> --port 6002 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 64 \
    --max_num_batched_tokens 16384 --max_model_len 16384

5. Send requests

curl http://<node0-ip>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3-235B-A22B", "messages": [{"role": "user", "content": "Hello"}]}'

📄 License

See individual component license.

📞 Support

Issues: GitHub Issues
Documentation: Check component READMEs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DLSlime		DLSlime
NanoCommon		NanoCommon
NanoCtrl		NanoCtrl
NanoDeploy		NanoDeploy
NanoDeployVL		NanoDeployVL
NanoRoute		NanoRoute
NanoSequence		NanoSequence
bench		bench
third_party		third_party
.clang-format		.clang-format
.clangd		.clangd
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoDeploy: LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism

📦 Components

🏗️ Architecture

🚀 Installation

One-liner: install everything

Install individual components

For developers

Quick Start: PD Disaggregated Serving

Prerequisites

1. Start Ray

Offline mode

Single node (no NanoCtrl needed)

PD disaggregated (2 nodes)

2. Start Redis + NanoCtrl

3. Launch engines

Online mode

2. Start Redis + NanoCtrl

3. Start NanoRoute

4. Launch engines

5. Send requests

📄 License

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

NanoDeploy: LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism

📦 Components

🏗️ Architecture

🚀 Installation

One-liner: install everything

Install individual components

For developers

Quick Start: PD Disaggregated Serving

Prerequisites

1. Start Ray

Offline mode

Single node (no NanoCtrl needed)

PD disaggregated (2 nodes)

2. Start Redis + NanoCtrl

3. Launch engines

Online mode

2. Start Redis + NanoCtrl

3. Start NanoRoute

4. Launch engines

5. Send requests

📄 License

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages