Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3122,6 +3122,26 @@ minimaxm2.5-fp8-b200-vllm:
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 4, conc-start: 4, conc-end: 256 }

minimaxm2.5-fp4-b200-vllm:
image: vllm/vllm-openai:v0.19.0-cu130
model: nvidia/MiniMax-M2.5-NVFP4
model-prefix: minimaxm2.5
runner: b200
precision: fp4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 512 }
- isl: 8192
osl: 1024
search-space:
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 4, conc-start: 4, conc-end: 256 }

gptoss-fp4-h100-vllm:
image: vllm/vllm-openai:v0.18.0
model: openai/gpt-oss-120b
Expand Down
79 changes: 79 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_b200.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
EP_SIZE \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export VLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --port $PORT \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add nvfp4 minimax to vllm recipes plz

--tensor-parallel-size=$TP \
$EP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--kv-cache-dtype fp8 \
--stream-interval 20 --no-enable-prefix-caching \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The script passes --stream-interval 20 to vllm serve, which is a SGLang-specific argument not recognized by vLLM; this will cause the server to exit immediately with an 'unrecognized arguments: --stream-interval' error, preventing the benchmark from running at all. Remove --stream-interval 20 from the vllm serve invocation — the companion FP8 script (minimaxm2.5_fp8_b200.sh) omits this flag and serves as the correct reference.

Extended reasoning...

Bug: --stream-interval is a SGLang-only argument passed to vllm serve

What the bug is and how it manifests

Line 50 of benchmarks/single_node/minimaxm2.5_fp4_b200.sh passes --stream-interval 20 as part of the vllm serve command. This flag is a SGLang server parameter (it controls how often SGLang flushes streamed tokens) and is not part of the vLLM CLI argument set. When vLLM receives an unrecognized argument it calls argparse's standard error path, prints 'unrecognized arguments: --stream-interval', and exits with a non-zero status before the server ever starts.

The specific code path

The relevant section of the script is:

vllm serve $MODEL --port $PORT
--tensor-parallel-size=$TP
$EP
--gpu-memory-utilization 0.95
--max-model-len $MAX_MODEL_LEN
--block-size=32
--kv-cache-dtype fp8
--stream-interval 20 --no-enable-prefix-caching   <-- line 50: invalid for vLLM
--trust-remote-code > $SERVER_LOG 2>&1 &

After vllm serve exits, wait_for_server_ready will poll until the timeout expires (or detect the dead PID), and the benchmark job fails without producing any results.

Why existing code does not prevent it

No argument validation is performed in benchmark_lib.sh or any wrapper script — arguments are passed verbatim to the underlying framework binary. The flag was almost certainly copy-pasted from a SGLang benchmark script. Every other occurrence of --stream-interval in the repo (dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_nvfp4_b200.sh, qwen3.5_bf16_b200.sh, etc.) is in a script that launches python3 -m sglang.launch_server, not vllm serve. A targeted search for 'vllm serve.*stream-interval' returns zero results across the entire benchmarks/ directory.

Step-by-step proof

  1. A CI job picks up minimaxm2.5-fp4-b200-vllm from nvidia-master.yaml.
  2. The harness executes minimaxm2.5_fp4_b200.sh inside the vLLM nightly container.
  3. The shell reaches the vllm serve block and starts the server process.
  4. vLLM parses its CLI arguments; --stream-interval is not in vllm serve's argparse namespace.
  5. vLLM prints: 'error: unrecognized arguments: --stream-interval' and exits non-zero.
  6. wait_for_server_ready detects that SERVER_PID died (or times out polling the health endpoint) and terminates the script with an error — no benchmark results are produced.

How to fix it

Simply remove --stream-interval 20 from the vllm serve invocation on line 50. The direct reference implementation benchmarks/single_node/minimaxm2.5_fp8_b200.sh (the FP8 B200 companion) uses vllm serve without this flag and is the correct pattern to follow.

--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1244,4 +1244,11 @@
- "Remove ISL 1024 / OSL 8192 seq-len config"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/947

- config-keys:
- minimaxm2.5-fp4-b200-vllm
description:
- "Add MiniMax-M2.5 NVFP4 vLLM benchmark config for B200"
- "Uses nvidia/MiniMax-M2.5-NVFP4 model checkpoint"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/996

Loading