Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ qwen3.5-bf16-mi355x-sglang:
- { tp: 8, conc-start: 4, conc-end: 64 }

qwen3.5-bf16-mi300x-sglang:
image: lmsysorg/sglang:v0.5.9-rocm720-mi30x
image: lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: mi300x
Expand All @@ -150,7 +150,7 @@ qwen3.5-bf16-mi300x-sglang:
- { tp: 8, conc-start: 4, conc-end: 64 }

qwen3.5-bf16-mi325x-sglang:
image: lmsysorg/sglang:v0.5.9-rocm720-mi30x
image: lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: mi325x
Expand All @@ -168,7 +168,7 @@ qwen3.5-bf16-mi325x-sglang:
- { tp: 8, conc-start: 4, conc-end: 64 }

qwen3.5-fp8-mi325x-sglang:
image: lmsysorg/sglang:v0.5.9-rocm720-mi30x
image: lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: mi325x
Expand Down Expand Up @@ -204,7 +204,7 @@ qwen3.5-fp8-mi355x-sglang:
- { tp: 8, conc-start: 4, conc-end: 64 }

qwen3.5-fp8-mi300x-sglang:
image: lmsysorg/sglang:v0.5.9-rocm720-mi30x
image: lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: mi300x
Expand Down
13 changes: 11 additions & 2 deletions benchmarks/single_node/qwen3.5_bf16_mi300x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ISL+OSL+200?

MAX_PREFILL_TOKENS=32768
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a bigger prefill size


EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -35,9 +38,15 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
Comment on lines 38 to 40
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The four MI300X/MI325X Qwen3.5 benchmark scripts (qwen3.5_bf16_mi300x.sh, qwen3.5_bf16_mi325x.sh, qwen3.5_fp8_mi300x.sh, qwen3.5_fp8_mi325x.sh) now pass --ep-size $EP_SIZE to sglang but do not include EP_SIZE in their check_env_vars calls, meaning the server will launch with an empty --ep-size argument if EP_SIZE is unset. The MI355X scripts added by this same PR correctly validate EP_SIZE — the MI300X and MI325X scripts should do the same.

Extended reasoning...

What the bug is and how it manifests

This PR adds --ep-size $EP_SIZE to the sglang launch command in all six Qwen3.5 benchmark scripts (BF16 and FP8 variants across MI300X, MI325X, and MI355X). However, only the MI355X scripts were also updated to add EP_SIZE to their check_env_vars call. The four MI300X/MI325X scripts omit this validation entirely, so if EP_SIZE is not set in the environment at runtime, Bash will silently substitute an empty string, producing --ep-size (empty argument) in the sglang invocation.

The specific code path that triggers it

In qwen3.5_bf16_mi300x.sh (and the three analogous scripts), the check_env_vars block at lines 5–11 validates MODEL, TP, CONC, ISL, OSL, RANDOM_RANGE_RATIO, and RESULT_FILENAME — but not EP_SIZE. Then at line 42, the server is launched with --ep-size $EP_SIZE. If EP_SIZE is absent, the check passes silently and the server receives a malformed argument. Compare to qwen3.5_bf16_mi355x.sh, which correctly adds EP_SIZE at line 13 of its check_env_vars block.

Why existing code does not prevent it

The check_env_vars function is the established guard for required environment variables in this codebase. Omitting a variable from that call means any downstream use of that variable proceeds without validation. Bash does not error on unset variables unless set -u is active; these scripts do not appear to set that option, so the substitution silently produces an empty string.

Impact

Looking at amd-master.yaml, the search-space entries for qwen3.5-bf16-mi300x-sglang, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi300x-sglang, and qwen3.5-fp8-mi325x-sglang only contain tp: keys — no ep: field — unlike other configs (e.g., minimaxm2.5-fp8-mi325x-vllm) that explicitly set ep:. If the benchmark runner does not populate EP_SIZE from the YAML config (because there is no ep: key), then every run of these four scripts will pass an empty --ep-size to sglang, likely causing a startup error or undefined behavior.

How to fix it

Add EP_SIZE to the check_env_vars call in all four affected scripts, matching what was done for the MI355X variants:

check_env_vars \
    MODEL \
    TP \
    CONC \
    ISL \
    OSL \
    RANDOM_RANGE_RATIO \
    RESULT_FILENAME \
    EP_SIZE   # <-- add this line

Alternatively, also add an ep: field to the MI300X/MI325X search-space entries in amd-master.yaml so that EP_SIZE is always populated.

Step-by-step proof

  1. User triggers a benchmark for qwen3.5-bf16-mi300x-sglang via CI.
  2. The runner reads amd-master.yaml; the search-space entry has only { tp: 8, conc-start: 4, conc-end: 64 } — no ep key — so EP_SIZE is never exported to the environment.
  3. qwen3.5_bf16_mi300x.sh starts; check_env_vars passes because EP_SIZE is not in its list.
  4. The script reaches line 42: --ep-size $EP_SIZE expands to --ep-size (empty string).
  5. sglang either rejects the argument at startup or interprets an empty ep-size, causing a server failure — the benchmark never produces results.

--data-parallel-size 1 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default value 1?

--trust-remote-code \
--mem-fraction-static 0.8 \
--disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 8?

--enable-aiter-allreduce-fusion \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 0.85?


SERVER_PID=$!

Expand Down
13 changes: 11 additions & 2 deletions benchmarks/single_node/qwen3.5_bf16_mi325x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -35,9 +38,15 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--trust-remote-code \
--mem-fraction-static 0.8 \
--disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
Comment on lines +43 to +44
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create sglang cookbook PR for adding these flags into the recipe doc please?

--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
16 changes: 14 additions & 2 deletions benchmarks/single_node/qwen3.5_bf16_mi355x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ check_env_vars \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
Expand All @@ -19,11 +20,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -34,8 +38,16 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--ep-size $EP_SIZE \
--data-parallel-size 1 \
--trust-remote-code \
--mem-fraction-static 0.8 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
13 changes: 11 additions & 2 deletions benchmarks/single_node/qwen3.5_fp8_mi300x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -36,9 +39,15 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--trust-remote-code \
--mem-fraction-static 0.8 \
--disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
13 changes: 11 additions & 2 deletions benchmarks/single_node/qwen3.5_fp8_mi325x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -36,9 +39,15 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--trust-remote-code \
--mem-fraction-static 0.8 \
--disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
16 changes: 14 additions & 2 deletions benchmarks/single_node/qwen3.5_fp8_mi355x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ check_env_vars \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
Expand All @@ -19,11 +20,14 @@ hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
Expand All @@ -34,8 +38,16 @@ python3 -m sglang.launch_server \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--ep-size $EP_SIZE \
--data-parallel-size 1 \
--trust-remote-code \
--mem-fraction-static 0.8 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-worker-num 6 \
--enable-aiter-allreduce-fusion \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1220,3 +1220,13 @@
- "Uses nvidia/GLM-5-NVFP4 model with modelopt_fp4 quantization"
- "Image: lmsysorg/sglang:nightly-dev-cu13-20260328-a27651d5"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/973
- config-keys:
- qwen3.5-bf16-mi300x-sglang
- qwen3.5-bf16-mi325x-sglang
- qwen3.5-fp8-mi300x-sglang
- qwen3.5-fp8-mi325x-sglang
description:
- "Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI300X and MI325X to achieve better performance"
- "Use lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/986

Comment on lines +1229 to +1232
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry uses pr-link: .../pull/xx — a placeholder that was never updated with the actual PR number. Since this PR is #986, the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/986 before merging.

Extended reasoning...

The changelog entry added by this PR ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xx (lines 1229-1232). The PR number is definitively known at merge time — it is #986 — yet the placeholder was not filled in.

The refutation argues that using a placeholder is an "established and accepted pattern" because other entries in the file also use /pull/XXX. This is partially true: entries for kimik2.5-int4-mi300x-vllm, minimaxm2.5-fp8-h200-vllm, glm5-fp8-mi355x-sglang, qwen3.5-bf16-mi325x-sglang, and qwen3.5-fp8-mi325x-sglang all use the uppercase /pull/XXX convention as a temporary placeholder. However, the distinction is that those entries use the conventional uppercase XXX sentinel, while this entry uses lowercase xx, which is inconsistent even with the existing placeholder convention.

More importantly, those other entries used XXX because they were written before the PR number was assigned or while the PR was in draft. In this case, the PR has a concrete number (#986) and should have that number in the link before merging — that is the entire purpose of updating the link at merge time.

Step-by-step proof:

  1. Contributor creates PR [AMD][MI30X]Update Qwen3.5 perf #986 and writes the changelog entry.
  2. The pr-link field is set to /pull/xx as a placeholder (likely copy-pasted or typed without the number).
  3. At merge time, the contributor should have updated this to /pull/986.
  4. As-is, anyone looking at the changelog for this Qwen3.5 performance update cannot click through to the PR that introduced it.

The fix is trivial: change pull/xx to pull/986. This is a nit since it only affects documentation traceability, not runtime behavior.

Loading