Skip to content

[WIP] Minimaxm2.5 nvfp4 b200#996

Open
Ankur-singh wants to merge 2 commits intomainfrom
minimaxm2.5-nvfp4-b200
Open

[WIP] Minimaxm2.5 nvfp4 b200#996
Ankur-singh wants to merge 2 commits intomainfrom
minimaxm2.5-nvfp4-b200

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@Ankur-singh Ankur-singh commented Apr 2, 2026

Summary

Add MiniMax-M2.5 NVFP4 benchmark configuration and script for B200 GPUs using vLLM.

Changes

  • Benchmark config (.github/configs/nvidia-master.yaml): Added minimaxm2.5-fp4-b200-vllm config using nvidia/MiniMax-M2.5-NVFP4 model with vLLM nightly image (vllm/vllm-openai:nightly-5b8c30d62b754b575e043ce2fc0dcbf8a64f6306). Supports TP=2 and TP=4 with concurrency range 4–256 at 1k/1k and 8k/1k sequence lengths.
  • Benchmark script (benchmarks/single_node/minimaxm2.5_fp4_b200.sh): New vLLM serve script with FP8 KV cache, expert parallelism support, and VLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl.
  • Perf changelog (perf-changelog.yaml): Added entry for the new config.

Test Plan

  • Run e2e benchmarks for minimaxm2.5-fp4-b200-vllm on B200 to validate config

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@Ankur-singh
Copy link
Copy Markdown
Collaborator Author

@claude update PR description and fix PR place holder in perf-changelog.yaml file

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 2, 2026

Claude finished @Ankur-singh's task in 1m 7s —— View job


Task: Update PR description and fix perf-changelog.yaml

  • Gather context from changed files
  • Update PR description
  • Fix PR placeholder in perf-changelog.yaml (pull/xxxpull/996)
  • Commit and push changes

Done

  • Updated PR description with a summary of changes (config, benchmark script, perf changelog) and a test plan.
  • Fixed perf-changelog.yaml — replaced placeholder pr-link: .../pull/xxx with pr-link: .../pull/996 and pushed commit 31510da.
    ·

Klaud-Cold pushed a commit that referenced this pull request Apr 2, 2026
Co-authored-by: Ankur Singh <Ankur-singh@users.noreply.github.com>
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--kv-cache-dtype fp8 \
--stream-interval 20 --no-enable-prefix-caching \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The script passes --stream-interval 20 to vllm serve, which is a SGLang-specific argument not recognized by vLLM; this will cause the server to exit immediately with an 'unrecognized arguments: --stream-interval' error, preventing the benchmark from running at all. Remove --stream-interval 20 from the vllm serve invocation — the companion FP8 script (minimaxm2.5_fp8_b200.sh) omits this flag and serves as the correct reference.

Extended reasoning...

Bug: --stream-interval is a SGLang-only argument passed to vllm serve

What the bug is and how it manifests

Line 50 of benchmarks/single_node/minimaxm2.5_fp4_b200.sh passes --stream-interval 20 as part of the vllm serve command. This flag is a SGLang server parameter (it controls how often SGLang flushes streamed tokens) and is not part of the vLLM CLI argument set. When vLLM receives an unrecognized argument it calls argparse's standard error path, prints 'unrecognized arguments: --stream-interval', and exits with a non-zero status before the server ever starts.

The specific code path

The relevant section of the script is:

vllm serve $MODEL --port $PORT
--tensor-parallel-size=$TP
$EP
--gpu-memory-utilization 0.95
--max-model-len $MAX_MODEL_LEN
--block-size=32
--kv-cache-dtype fp8
--stream-interval 20 --no-enable-prefix-caching   <-- line 50: invalid for vLLM
--trust-remote-code > $SERVER_LOG 2>&1 &

After vllm serve exits, wait_for_server_ready will poll until the timeout expires (or detect the dead PID), and the benchmark job fails without producing any results.

Why existing code does not prevent it

No argument validation is performed in benchmark_lib.sh or any wrapper script — arguments are passed verbatim to the underlying framework binary. The flag was almost certainly copy-pasted from a SGLang benchmark script. Every other occurrence of --stream-interval in the repo (dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_nvfp4_b200.sh, qwen3.5_bf16_b200.sh, etc.) is in a script that launches python3 -m sglang.launch_server, not vllm serve. A targeted search for 'vllm serve.*stream-interval' returns zero results across the entire benchmarks/ directory.

Step-by-step proof

  1. A CI job picks up minimaxm2.5-fp4-b200-vllm from nvidia-master.yaml.
  2. The harness executes minimaxm2.5_fp4_b200.sh inside the vLLM nightly container.
  3. The shell reaches the vllm serve block and starts the server process.
  4. vLLM parses its CLI arguments; --stream-interval is not in vllm serve's argparse namespace.
  5. vLLM prints: 'error: unrecognized arguments: --stream-interval' and exits non-zero.
  6. wait_for_server_ready detects that SERVER_PID died (or times out polling the health endpoint) and terminates the script with an error — no benchmark results are produced.

How to fix it

Simply remove --stream-interval 20 from the vllm serve invocation on line 50. The direct reference implementation benchmarks/single_node/minimaxm2.5_fp8_b200.sh (the FP8 B200 companion) uses vllm serve without this flag and is the correct pattern to follow.

Comment on lines +1241 to +1245
description:
- "MiniMax-M2.5 NVFP4 benchmark config for B200"
- "Uses nvidia/MiniMax-M2.5-NVFP4 model checkpoint"
- "Image: vllm/vllm-openai:nightly-5b8c30d62b754b575e043ce2fc0dcbf8a64f6306"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/996
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry for minimaxm2.5-fp4-b200-vllm has pr-link: .../pull/xxx — a lowercase placeholder that was never updated to the actual PR number (996). After merge, the changelog link will point to a non-existent URL.

Extended reasoning...

What the bug is: The changelog entry added by this PR ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxx. The lowercase xxx is clearly a placeholder that was never replaced with the actual PR number before submission.

The specific code path: In perf-changelog.yaml at lines 1241-1245, the newly added entry for minimaxm2.5-fp4-b200-vllm reads:

- config-keys:
    - minimaxm2.5-fp4-b200-vllm
  description:
    - "MiniMax-M2.5 NVFP4 benchmark config for B200"
    - "Uses nvidia/MiniMax-M2.5-NVFP4 model checkpoint"
    - "Image: vllm/vllm-openai:nightly-5b8c30d62b754b575e043ce2fc0dcbf8a64f6306"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxx

Why existing code doesn't prevent it: The changelog file is manually maintained YAML with no automated validation of pr-link values. There's no CI check that verifies placeholder values aren't committed.

Impact: Once merged, anyone referencing the changelog entry for minimaxm2.5-fp4-b200-vllm will follow a broken link. The correct URL should be https://github.com/SemiAnalysisAI/InferenceX/pull/996. It's worth noting that 7 other entries in the same file also use uppercase XXX placeholders — making this a broader documentation hygiene issue — but the present PR introduces a new instance with lowercase xxx that can be fixed now since the PR number is known.

How to fix: Replace pull/xxx with pull/996 in the new entry.

Step-by-step proof:

  1. This PR is numbered 996 (visible in PR metadata: <pr number="996">).
  2. The diff adds a new entry to perf-changelog.yaml with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/xxx.
  3. Navigating to https://github.com/SemiAnalysisAI/InferenceX/pull/xxx after merge would return a 404 (or redirect to a search page), since xxx is not a valid PR number.
  4. The fix is trivial: change xxx996 in that one line.

start_gpu_monitor

set -x
vllm serve $MODEL --port $PORT \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add nvfp4 minimax to vllm recipes plz

- { tp: 4, conc-start: 4, conc-end: 64 }

minimaxm2.5-fp4-b200-vllm:
image: vllm/vllm-openai:nightly-5b8c30d62b754b575e043ce2fc0dcbf8a64f6306
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.19 comes out tmw btw

@Ankur-singh Ankur-singh force-pushed the minimaxm2.5-nvfp4-b200 branch from 31510da to 08bcde8 Compare April 2, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants