Add support for NVIDIA DGX Spark (GB10 / sm_121a, arm64)#1835
Open
boots-coder wants to merge 5 commits intoTHUDM:mainfrom
Open
Add support for NVIDIA DGX Spark (GB10 / sm_121a, arm64)#1835boots-coder wants to merge 5 commits intoTHUDM:mainfrom
boots-coder wants to merge 5 commits intoTHUDM:mainfrom
Conversation
Author
37561d6 to
9447a85
Compare
The help= kwarg was written as a single-element tuple (trailing comma), which makes argparse format_help() raise: AttributeError: 'tuple' object has no attribute 'strip' on any `python train.py --help` invocation. Remove the comma so the help text is a plain string.
slime's published images are x86_64-only. The arm64 slimerl/sglang base
ships CUDA 12.9 whose ptxas lacks an sm_121a target, so Triton crashes on
first JIT. The ENABLE_CUDA_13=1 branch in the upstream Dockerfile is
aimed at GB200/GB300 (sm_100a) with an x86-only sgl-router wheel and does
not work on GB10.
This change adds a second Dockerfile targeting GB10 specifically:
- Rebased on nvcr.io/nvidia/vllm:26.03-py3 (arm64), pinned by digest.
Ships CUDA 13.2, PyTorch 2.11.0a0 with compute_120 PTX,
Triton 3.6.0, flash-attn 2.7.4.post1.
- sgl-kernel is rebuilt from source with a new CMake option
SGL_KERNEL_GB10_ONLY=ON (patch_sgl_kernel.py / sgl-kernel-arch.patch)
that restricts gencode to sm_120a + sm_121a. The stock 7-arch
emission OOM-kills cicc on 128 GB Spark hosts (each extra cutlass
FP8 gemm arch costs ~10-15 GB RAM per TU).
- TransformerEngine 2.10 is built with NVTE_CUDA_ARCHS=120f;121f.
The 'f' (family-specific) arch suffix is required by TE's ptx.cuh
static_assert and is only parsed by CMake >= 4.0, so cmake is
upgraded to 4.3.1 (NGC's PIP_CONSTRAINT must be cleared).
- Small CUDA 13 gaps that NGC does not ship are filled:
* cuda_profiler_api.h shim (symbols remain in libcudart.so)
* NVTX3 headers (copied from github.com/NVIDIA/NVTX)
* libcudnn_engines_precompiled.so.9 (from nvidia-cudnn-cu13 wheel)
- The x86-only zhuzilin/sgl-router wheel is swapped for upstream
sglang-router==0.3.2 (has arm64 wheel). slime's version-compare
code still works; only the 'slime' in version wandb branch differs.
Also adds scripts/run-qwen2.5-0.5B-gb10-smoke.sh: a 1-GPU colocated
smoke that exercises the full rollout -> reward -> policy-update ->
weight-sync cycle. Validated end to end with Qwen2.5-0.5B on
dapo-math-17k; a single step completes in ~2m10s on GB10.
Captures the root causes and resolutions for every issue encountered while porting the full stack (sgl-kernel, TransformerEngine, apex, Megatron-LM, sglang, slime) to GB10 / sm_121a. Placed next to Dockerfile.gb10 so reviewers see the reference material when reading the Dockerfile.
fd1e8ef to
e98e60c
Compare
…nc, math_verify, pyarrow, accelerate) These packages were installed manually during the interactive GB10 build session but were missing from the Dockerfile, breaking clean reproduction. - numpy<2: Megatron requires numpy 1.x (NGC ships 2.x) - pylatexenc, math_verify, word2number: reward function runtime deps - pyarrow: GSM8K parquet data preprocessing - accelerate: HuggingFace transformers device_map support
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.






Add NVIDIA DGX Spark (GB10 / sm_121a) support
Summary
This PR adds support for NVIDIA DGX Spark (Project Digits, GB10 chip — Grace
CPU + consumer Blackwell GPU at
sm_121a, aarch64, 128 GB unified memory) toslime's docker build pipeline. GB10 is an explicit gap in the current support
matrix:
slimerl/slime:*) are x86_64-only.slimerl/sglang:v0.5.9ships CUDA 12.9 whoseptxashas nosm_121atarget, so triton JIT crashes on first kernel.ENABLE_CUDA_13=1branch in the upstream Dockerfile is aimed at GB200/GB300(
sm_100a) with an x86-only sgl-router wheel.This PR rebases a second Dockerfile (
docker/Dockerfile.gb10) onnvcr.io/nvidia/vllm:26.03-py3(arm64), which ships CUDA 13.2, PyTorch 2.11.0a0with
compute_120PTX, Triton 3.6.0, and flash-attn 2.7.4.post1 — the minimalviable baseline for GB10. Fifteen small blockers are resolved in the Dockerfile
and supporting patches.
End-to-end validation: Qwen2.5-0.5B + GRPO + dapo-math-17k, 1 GB10 GPU,
colocated actor + rollout, one full rollout → reward → policy-update cycle
completed in 2m10s (step 0 metrics logged, weight sync to SGLang succeeded,
checkpoint saved).
What's in the patch
New files
docker/Dockerfile.gb10— digest-pinned, arm64-native build on NGC vllm basedocker/patch/gb10/patch_sgl_kernel.py— addsSGL_KERNEL_GB10_ONLYCMake optiondocker/patch/gb10/sgl-kernel-arch.patch— unified diff of the samedocker/patch/gb10/cuda_profiler_api.h— 20-line shim for a CUDA 13 dropped headerscripts/run-qwen2.5-0.5B-gb10-smoke.sh— minimal 1-GPU smoke scriptNOTES_GB10.md— walkthrough of the 15 blockers with root causesExisting file edits
slime/utils/arguments.py— remove trailing comma that turned anargparsehelp=string into a single-element tuple (causes'tuple' has no attribute 'strip'during
--help; affects all platforms, not just GB10)The 15 blockers (quick reference)
ptxas fatal: sm_121alibnvrtc.so.12missing from sgl_kernel wheelSGL_KERNEL_GB10_ONLYCMake optionCUDNN::cudnn_engines_precompilednot foundnvidia-cudnn-cu13==9.20.0.48pypi wheel, symlinknvtx3/nvToolsExt.hmissingNVTE_CUDA_ARCHS="120f;121f"fsuffixcmake==4.3.1, overridePIP_CONSTRAINTcuda_profiler_api.hmissing in CUDA 13'tuple' object has no attribute 'strip'sglang_routerx86_64 onlysglang-router==0.3.2(arm64 wheel)antlr4==4.13.2breaks omegaconfantlr4-python3-runtime==4.9.3megatron.trainingnot found postpip install -ePYTHONPATH=/root/src/Megatron-LMlibz3.soapt install libz3-devTesting
Run inside the image:
Scope this PR does NOT cover
Reproducibility
Base image is pinned by digest: