Skip to content

DeepLink-org/NanoDeploy

Repository files navigation

NanoDeploy: LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism

πŸ“¦ Components

Component Language Description Key Features
DLSlime C++ RDMA communication Zero-copy KV cache migration, P2P mesh networking, GPUDirect RDMA
NanoCtrl Rust Control plane Redis-backed service registry, health monitoring, engine discovery, Python client
NanoDeploy Python/C++ LLM inference engine Prefill/decode engines, KV cache management, continuous batching, Ray-based distributed workers
NanoDeployVL Python Vision-Language encoder EP-separated ViT encoder, RDMA embedding transfer, Qwen3-VL support
NanoRoute Rust HTTP load balancer OpenAI-compatible API, tool calls, routing strategies, engine discovery

πŸ—οΈ Architecture

graph TB
    Client[Client Layer<br/>HTTP Requests / OpenAI SDK]
    Route[NanoRoute<br/>Rust/HTTP<br/>Load Balancer]
    VL[NanoDeployVL<br/>Vision Encoder]
    Prefill[Prefill Engine<br/>Python/C++]
    Decode[Decode Engine<br/>Python/C++]
    Ctrl[NanoCtrl<br/>Redis<br/>Service Registry]

    Client -->|HTTP| Route
    Route -->|ZMQ| VL
    Route -->|ZMQ| Prefill
    Route -->|ZMQ| Decode
    VL -->|RDMA<br/>Embeddings| Prefill
    Prefill -->|RDMA<br/>KV Migration| Decode
    VL -->|Register/Heartbeat| Ctrl
    Prefill -->|Register/Heartbeat| Ctrl
    Decode -->|Register/Heartbeat| Ctrl
    Route -->|Engine Discovery| Ctrl
Loading

πŸš€ Installation

The root pyproject.toml acts as a meta-package that lets you install any combination of Python components in a single command.

One-liner: install everything

pip install ".[all]"

Install individual components

pip install ".[dlslime]"      # DLSlime transfer engine only
pip install ".[nanoctrl]"     # NanoCtrl lifecycle client only
pip install ".[nanodeploy]"   # NanoDeploy inference engine only
pip install ".[nanodeployvl]" # NanoDeployVL vision-language encoder only

For developers

# Build NanoDeploy C++ extensions in-place
cd NanoDeploy && pip install -e . && cd ..

# Build NanoRoute (Rust)
cd NanoRoute && cargo build --release && cd ..

# Build NanoCtrl (Rust) + install Python client
cd NanoCtrl && cargo build --release && pip install -e . && cd ..

Quick Start: PD Disaggregated Serving

Prefill-Decode disaggregation splits prompt processing (prefill) and token generation (decode) across separate GPU nodes connected via RDMA.

Prerequisites

  • 2 nodes with NVIDIA GPUs (SM90+ for FP8), RDMA-capable NICs
  • Redis, Ray cluster, Rust toolchain

1. Start Ray

# Node 0 (head)
ray start --head --port=7078 --dashboard-host=0.0.0.0

# Node 1 (multi-node only)
ray start --address <node0-ip>:7078

Offline mode

Batch generation without HTTP serving.

Single node (no NanoCtrl needed)

python NanoDeploy/examples/non_disagg.py \
    --model /models/Qwen3-235B-A22B \
    --ray_address <node0-ip>:7078 \
    --master_address <node0-ip>:6006 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 256 \
    --prompt "1+1=?" --max_tokens 128

PD disaggregated (2 nodes)

2. Start Redis + NanoCtrl
redis-server --bind 0.0.0.0 --port 6379
cd NanoCtrl && cargo run --release    # edit config.toml to set redis_url
3. Launch engines
python NanoDeploy/examples/disagg.py \
    --model /models/Qwen3-235B-A22B \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --attention_dp 8 --ffn_ep 8 \
    --prefill.master_address <node0-ip>:6006 \
    --decode.master_address <node1-ip>:6006

Online mode

ZMQ engine servers with OpenAI-compatible HTTP API via NanoRoute.

2. Start Redis + NanoCtrl
redis-server --bind 0.0.0.0 --port 6379
cd NanoCtrl && cargo run --release    # edit config.toml to set redis_url
3. Start NanoRoute
cd NanoRoute && cargo run --release    # edit config.toml to set nanoctrl_address
4. Launch engines
# Terminal 1 β€” Decode engine
python NanoDeploy/nanodeploy/server/engine_server.py \
    --model /models/Qwen3-235B-A22B \
    --mode decode \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --nanoctrl_scope nanoctrl-0 \
    --master_address <node1-ip>:6006 \
    --host <node0-ip> --port 6001 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 64 \
    --max_num_batched_tokens 16384 --max_model_len 16384

# Terminal 2 β€” Prefill engine
python NanoDeploy/nanodeploy/server/engine_server.py \
    --model /models/Qwen3-235B-A22B \
    --mode prefill \
    --ray_address <node0-ip>:7078 \
    --nanoctrl_address <node0-ip>:3000 \
    --nanoctrl_scope nanoctrl-0 \
    --master_address <node0-ip>:6006 \
    --host <node0-ip> --port 6002 \
    --attention_dp 8 --ffn_ep 8 \
    --kvcache_block_size 64 \
    --max_num_batched_tokens 16384 --max_model_len 16384
5. Send requests
curl http://<node0-ip>:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen3-235B-A22B", "messages": [{"role": "user", "content": "Hello"}]}'

πŸ“„ License

See individual component license.

πŸ“ž Support

About

LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages