feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998
feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998JordanNanos wants to merge 3 commits intomainfrom
Conversation
Port MI325X disagg+BNXT changes to MI300X cluster (3 working nodes: mi300x-amds_0/2/3). MI300X is gfx942 identical to MI325X, using the same Vultr/CPE Broadcom Thor 2 bnxt_re NICs. Changes: - Add mi300x-disagg runner group (amds_0/2/3) to runners.yaml - Add dsr1-fp8-mi300x-sglang-disagg and -disagg-mtp benchmark configs - Add dsr1-fp8-mi300x-sglang-mtp single-node MTP benchmark config - Add chi-mi300x-* NIC detection (bnxt_re0-7) and QoS (TC=104, SL=3) - Add dsr1_fp8_mi300x_sglang-disagg.sh multi-node benchmark script - Add MTP support to single-node dsr1_fp8_mi300x.sh - Port launch_mi300x-amds.sh to full multi-node launcher with MODEL_YAML_KEY - Add docker/mi300x.Dockerfile + build script (registry API retag) - Add scripts/manual-test-mi300x.sh - Update perf-changelog.yaml Image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
runners/launch_mi300x-amds.sh
Outdated
| JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}") | ||
|
|
||
| LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.out" | ||
|
|
||
| sleep 10 | ||
|
|
||
| while ! ls "$LOG_FILE" &>/dev/null; do | ||
| if ! squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID"; then | ||
| echo "ERROR: Job $JOB_ID failed before creating log file" | ||
| scontrol show job "$JOB_ID" | ||
| exit 1 | ||
| fi | ||
| sleep 5 | ||
| done | ||
|
|
||
| set +x | ||
|
|
||
| ( | ||
| while squeue -u $USER --noheader --format='%i' | grep -q "$JOB_ID"; do |
There was a problem hiding this comment.
🔴 The multi-node path in runners/launch_mi300x-amds.sh captures JOB_ID from the benchmark script (line 90) without checking that the value is non-empty; if the inner script fails silently, JOB_ID is empty, LOG_FILE becomes slurm_job-.out (never created), and the wait loop spins forever. Compounding this, both grep -q "$JOB_ID" calls (inside the wait-for-log loop and the polling subshell) use unanchored substring matching, so an empty JOB_ID matches every squeue line, and a valid short ID like 1234 can spuriously match a sibling job 12345, causing the polling loop to keep running after the real job finishes. Fix: add if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi immediately after line 90, and change both grep calls to grep -qx "$JOB_ID" (exact line match).
Extended reasoning...
Bug 1 – Missing empty JOB_ID guard (primary)
At the point where JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}") executes in the multi-node path, the exit code of the inner script is silently ignored (no set -e is in effect at that point, and there is no explicit || guard). If dsr1_fp8_mi300x_sglang-disagg.sh calls submit.sh, which can exit 1 on failure, the outer script's command substitution captures an empty string into JOB_ID. At that point execution continues normally. The single-node else branch (added in the same PR) does include if [ -z "$JOB_ID" ]; then exit 1; fi after its own salloc capture, but the new multi-node if branch omits the equivalent check entirely.
How the infinite loop manifests
After the empty-JOB_ID assignment: LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out" — a file that never exists. The outer while \! ls "$LOG_FILE" loop therefore runs indefinitely. Inside the loop, squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID" becomes grep -q "", and an empty pattern matches every non-empty line of input, so as long as the user has any SLURM job running, the \! grep -q "" condition is always false and the early-exit branch never fires. The loop spins at 5-second intervals until all other user jobs happen to drain from the queue—which could take hours or days in a shared CI cluster.
Bug 2 – Unanchored grep causes false positive job-ID matches
Both grep calls use grep -q "$JOB_ID" against squeue --format='%i' output, which emits one integer job ID per line. grep matches substrings, not full lines. SLURM job IDs are sequentially assigned integers; if the current job is ID 123456 and another user job 1234567 is also in the queue, grep -q "123456" matches the line 1234567. Consequence at line 97 (wait-for-log loop): the early-exit guard never fires even after job 123456 fails and leaves the queue, so the loop waits forever for a log file that was never written. Consequence at line 108 (polling subshell): the while squeue … | grep -q "$JOB_ID" loop keeps looping after the real job exits because the sibling job keeps matching; POLL_PID never exits, so tail -F --pid=$POLL_PID keeps tailing, and wait $POLL_PID blocks indefinitely.
Step-by-step proof of the empty-JOB_ID infinite loop
submit.shinsidedsr1_fp8_mi300x_sglang-disagg.shexits non-zero (e.g., cluster unreachable).- The benchmark script echoes nothing to stdout and exits 1.
JOB_ID=$(bash "benchmarks/multi_node/…sh")— JOB_ID = "".- No guard checks JOB_ID; execution continues.
LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out"— nonexistent path.- Outer loop:
while \! ls "slurm_job-.out"— always true. - Inner check:
squeue -u "$USER" --noheader --format='%i' | grep -q ""— empty pattern matches every non-empty line; as long as the cluster has any user job,\! grep -q ""is false, soexit 1is never reached. - Loop spins at 5 s intervals indefinitely. CI run hangs.
Fix
After the JOB_ID assignment: if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi. Replace both grep calls with grep -qx "$JOB_ID" (or grep -q "^${JOB_ID}$") to enforce exact-line matching and prevent substring collisions.
perf-changelog.yaml
Outdated
| - "Image (single-node): lmsysorg/sglang:v0.5.9-rocm700-mi30x" | ||
| - "Full pareto sweep: non-MTP and MTP configs across 2 curve points (no EP/DP), ISL 1k/1k and 8k/1k" | ||
| - "Runners: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3 (3 working nodes)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982 |
There was a problem hiding this comment.
🟡 The pr-link field in the new perf-changelog.yaml entry points to issues/982 (the tracking issue) instead of the implementing PR URL. All other non-placeholder entries in the file use the /pull/<number> format; the correct link for this PR should be /pull/998.
Extended reasoning...
What the bug is: The newly added perf-changelog.yaml entry (lines 1-15) sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982. The field is named pr-link and is intended to reference the pull request that implements the change, not the issue that motivated it.
The specific code path: Line 13 of perf-changelog.yaml reads:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982Every other properly-numbered entry in the file uses the /pull/<number> format (e.g., /pull/919, /pull/857, /pull/95). When a PR number is not yet known, the convention in this file is to use /pull/XXX as a placeholder (as in the entry for kimik2.5-int4-mi300x-vllm directly below).
Why existing code does not prevent it: There is no automated validation on pr-link values in the changelog; it is a free-text YAML field, so incorrect URLs are not caught at commit time.
Impact: Anyone reviewing the changelog will follow the pr-link to GitHub issue #982 (the tracking issue) rather than to this implementing PR (#998). This breaks the intended traceability — readers cannot navigate from the changelog entry to the code change itself.
How to fix it: Change line 13 from:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/998Step-by-step proof:
- The PR description states "Closes starter task: MVP port mi355 deepseek disagg recipe to mi300 #982", meaning starter task: MVP port mi355 deepseek disagg recipe to mi300 #982 is the issue being closed, not the PR number.
- This PR is feat: MI300X disaggregated inference with Broadcom IBGDA (#982) #998 (confirmed by
<pr number="998">in the metadata). - Scanning perf-changelog.yaml, every entry with a known PR number uses the
/pull/<number>format. - Entries with unknown PR numbers use
/pull/XXXas a placeholder. - The new entry is the only one using an
/issues/URL, making it factually incorrect metadata.
On the duplicate claim: One verifier noted this may overlap with bug_005. The underlying bug is real and independently confirmed by three verifiers; the issue stands on its own merits.
| #SBATCH --time=00:15:00 | ||
| #SBATCH --output=/home/j9s/build-mi300x-%j.log | ||
| #SBATCH --error=/home/j9s/build-mi300x-%j.log | ||
| #SBATCH --chdir=/tmp | ||
|
|
||
| set -euo pipefail | ||
| bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh |
There was a problem hiding this comment.
🟡 docker/build-sglang-bnxt-mi300x.sbatch hardcodes personal paths for user j9s (log dirs under /home/j9s/ and script path under /nfsdata/sa/j9s/), making the file unusable by any other team member. Fix by replacing the log paths with a shared or home-relative path (e.g. $HOME or %u in the #SBATCH directive) and the script path with a SLURM_SUBMIT_DIR-relative reference.
Extended reasoning...
Bug: Hardcoded personal user paths in build-sglang-bnxt-mi300x.sbatch
What the bug is and how it manifests
docker/build-sglang-bnxt-mi300x.sbatch is a new file introduced by this PR that contains three hardcoded absolute paths belonging to a single cluster user (j9s):
- Line 9:
#SBATCH --output=/home/j9s/build-mi300x-%j.log - Line 10:
#SBATCH --error=/home/j9s/build-mi300x-%j.log - Line 14:
bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh
Any other team member who runs sbatch docker/build-sglang-bnxt-mi300x.sbatch will get an immediate failure: SLURM will reject the job (or fail silently at output redirection) because /home/j9s/ does not exist for them, and the bash invocation will fail with "No such file or directory" for the NFS path.
The specific code path that triggers it
The job is submitted with sbatch. SLURM processes the #SBATCH --output and --error directives before the script body runs, attempting to open the log files under /home/j9s/. Then line 14 executes bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh, which resolves to a personal NFS home directory on the cluster.
Why existing code doesn't prevent it
The #SBATCH directives are evaluated statically by SLURM at submission time; environment variables like /root are not expanded in #SBATCH comment lines. However, SLURM does support the %u token (username expansion) in output/error paths, which would make them portable. The command on line 14 is a normal shell line and could use ${SLURM_SUBMIT_DIR} or a path relative to --chdir to reference the script in the repo, but --chdir=/tmp was set on line 11, making relative paths from the repo root unavailable without an explicit variable.
What the impact would be
Any team member other than j9s who tries to rebuild the MI300X image by submitting this batch job will get a broken job. The log output will be lost (or SLURM may refuse submission), and the actual build script will not be found, so the retag operation cannot complete. While this is a developer convenience script and the image is already published, the file is committed to a shared repository and gives the false impression that it is reusable.
How to fix it
-
Replace the log paths with a shared or user-relative path, e.g.:
#SBATCH --output=%u-build-mi300x-%j.log #SBATCH --error=%u-build-mi300x-%j.logOr use a writable shared directory such as
/tmp/%u-build-mi300x-%j.log. -
Replace the hardcoded script path with one derived from the submission directory:
bash "${SLURM_SUBMIT_DIR}/build-sglang-bnxt-mi300x.sh"(changing
--chdirto something other than /tmp, or removing it and submitting from the repo root).
Step-by-step proof
- User alice clones the repo and runs:
sbatch docker/build-sglang-bnxt-mi300x.sbatch - SLURM reads
--output=/home/j9s/build-mi300x-$SLURM_JOB_ID.log— /home/j9s does not exist for alice → SLURM may fail to open the log file or silently discard output. - Even if the job starts, line 14 runs:
bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh— this NFS path is in j9s's personal directory and inaccessible to alice →bash: /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh: No such file or directory, exit code 1. - The build fails immediately without executing any docker registry logic.
- launch_mi300x-amds.sh: guard empty JOB_ID after benchmark script, use grep -qx for exact-line matching to prevent substring false positives - perf-changelog.yaml: fix pr-link from issues/982 to pull/998 - build-sglang-bnxt-mi300x.sbatch: replace hardcoded j9s paths with %u (SLURM username token) and SLURM_SUBMIT_DIR
MI300X uses a node-local HF cache (/home/gharunner/gharunners/hf-hub-cache/) that is not accessible from the controller where the launcher runs. This causes MODEL_NAME to fall back to the bare name (e.g. DeepSeek-R1-0528) instead of the full snapshot path. Add HF cache resolution inside job.slurm which runs on the compute node where the cache actually exists.
Ports all MI325X disagg+BNXT fixes (from #985) to the MI300X cluster.
MI300X is gfx942 identical to MI325X, same Vultr/CPE cluster with Broadcom Thor 2 bnxt_re NICs.
3 working nodes: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3
Sweep: single-node w/wo MTP + multinode w/wo MTP, DeepSeek-R1 FP8, ISL 1k/1k and 8k/1k
Image:
ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt(retag of MI325X image, same gfx942)Closes #982