-
Notifications
You must be signed in to change notification settings - Fork 120
[WIP] Minimaxm2.5 nvfp4 b200 #996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
67fdc4d
08bcde8
1ac9636
3b130da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| export VLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl | ||
|
|
||
| if [ "$EP_SIZE" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size=32 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --stream-interval 20 --no-enable-prefix-caching \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The script passes --stream-interval 20 to vllm serve, which is a SGLang-specific argument not recognized by vLLM; this will cause the server to exit immediately with an 'unrecognized arguments: --stream-interval' error, preventing the benchmark from running at all. Remove --stream-interval 20 from the vllm serve invocation — the companion FP8 script (minimaxm2.5_fp8_b200.sh) omits this flag and serves as the correct reference. Extended reasoning...Bug: --stream-interval is a SGLang-only argument passed to vllm serveWhat the bug is and how it manifests Line 50 of benchmarks/single_node/minimaxm2.5_fp4_b200.sh passes --stream-interval 20 as part of the vllm serve command. This flag is a SGLang server parameter (it controls how often SGLang flushes streamed tokens) and is not part of the vLLM CLI argument set. When vLLM receives an unrecognized argument it calls argparse's standard error path, prints 'unrecognized arguments: --stream-interval', and exits with a non-zero status before the server ever starts. The specific code path The relevant section of the script is: After vllm serve exits, wait_for_server_ready will poll until the timeout expires (or detect the dead PID), and the benchmark job fails without producing any results. Why existing code does not prevent it No argument validation is performed in benchmark_lib.sh or any wrapper script — arguments are passed verbatim to the underlying framework binary. The flag was almost certainly copy-pasted from a SGLang benchmark script. Every other occurrence of --stream-interval in the repo (dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_nvfp4_b200.sh, qwen3.5_bf16_b200.sh, etc.) is in a script that launches python3 -m sglang.launch_server, not vllm serve. A targeted search for 'vllm serve.*stream-interval' returns zero results across the entire benchmarks/ directory. Step-by-step proof
How to fix it Simply remove --stream-interval 20 from the vllm serve invocation on line 50. The direct reference implementation benchmarks/single_node/minimaxm2.5_fp8_b200.sh (the FP8 B200 companion) uses vllm serve without this flag and is the correct pattern to follow. |
||
| --trust-remote-code > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add nvfp4 minimax to vllm recipes plz