Conversation
The sglang 0.5.8 Docker image ships a newer lm-eval 0.4.9.2 commit that defaults fewshot_as_multiturn=True for chat-completion models. Since the version string matches the pinned commit, pip silently skips the install. Adding --force-reinstall ensures the pinned commit is always used regardless of what's pre-installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds dsr1-fp8-mi355x-sglang-disagg-nodpa-eval: same image/model/precision as the DPA config but with dp-attn=false and ep=1. Running evals on this will tell us if DPA is the cause of the 0% GSM8K score or if it's something else about the fp8 disagg setup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@Oseltamivir can u upstream ur changes to |
There was a problem hiding this comment.
This is a large infrastructure PR (22 files) touching multi-node CI/CD workflows, all NVIDIA runner scripts, AMD server logic, and Python result collection — including a switch from the ishandhanani/srt-slurm fork to Oseltamivir/srt-slurm across all NVIDIA runners, which warrants a human look.
Extended reasoning...
Overview
The PR adds eval-only support for multi-node benchmarks, touching GitHub Actions workflows (benchmark-multinode-tmpl.yml, e2e-tests.yml, run-sweep.yml), all six NVIDIA Slurm runner scripts, the AMD MI355X server.sh/job.slurm/submit.sh, shared benchmark_lib.sh, and Python utilities for config generation and result collection.
Security risks
The most notable concern is the switch from ishandhanani/srt-slurm to Oseltamivir/srt-slurm@sa-submission-q1-2026 across all NVIDIA multi-node runners. This changes the external code being cloned and executed on production cluster runners. While the PR description enumerates the fork's delta vs upstream, a human should validate the trust decision of pinning to this fork at this branch.
Level of scrutiny
High scrutiny is warranted. This PR touches production CI/CD infrastructure across multiple hardware platforms, introduces a new external dependency fork, and the PR description itself documents known partial failures (H200 dynamo-sglang jobs failing, MI355X DPA=true rows failing at 0.0). These open issues suggest the eval path is not fully stable across all targets yet.
Other factors
The no-bugs finding from the automated system is reassuring for logic correctness, but the scope (22 files, new workflow job types, eval artifact pipeline, split summary tables) and the documented known failures make this a PR that should have at least one human reviewer before merge.
Summary
Add eval-only support for multi-node benchmarks and wire those eval results into CI collection + summary reporting.
This covers:
server.shsrt-slurmforkHow evals are run
Single-node evals are selected on
8k1kat max + median concurrency for each(model, runner, framework, precision, spec-decoding, dp-attn)group.Multi-node evals are selected on
8k1kby taking the entry with the highest max concurrency for each(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)group, then running eval at themedian concurrency from that config via
eval-conc.EVAL_ONLY=truestarts the server with expanded eval context, skips throughput benchmarking, runslm-eval,writes
meta_env.json+results*.json+sample*.jsonl, uploads those artifacts, then validates scoresagainst thresholds.
srt-slurm fork delta vs upstream
NVIDIA multinode eval uses
Oseltamivir/srt-slurm@sa-submission-q1-2026instead ofishandhanani/srt-slurm.Compared with current
upstream/main, that fork adds the eval path InferenceX needs:lm-evalbenchmark runner/infmax-workspacemounting viaINFMAX_WORKSPACEEVAL_ONLYsupport indo_sweep.pyto skip benchmark stage and run post-eval directlywait_for_model()health checking before eval in eval-only modeMODEL_NAME=self.config.served_model_nameso eval queries the served alias, not the HF repo idEVAL_CONCfrom workflow toEVAL_CONCURRENT_REQUESTS/logs/eval_results/for launcher-side artifact pickupValidation
238888245062380242393923882945894dynamo-sglangFP8 MTP job hit server health timeout23909140268dynamo-trtpassed,dynamo-sglangjobs failed before Slurm log creation, issue raised23800447228