-
Notifications
You must be signed in to change notification settings - Fork 120
[AMD][MI30X]Update Qwen3.5 perf #986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,11 +19,14 @@ hf download "$MODEL" | |
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not a bigger prefill size |
||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
@@ -35,9 +38,15 @@ python3 -m sglang.launch_server \ | |
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
|
Comment on lines
38
to
40
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The four MI300X/MI325X Qwen3.5 benchmark scripts ( Extended reasoning...What the bug is and how it manifests This PR adds The specific code path that triggers it In Why existing code does not prevent it The Impact Looking at How to fix it Add check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE # <-- add this lineAlternatively, also add an Step-by-step proof
|
||
| --data-parallel-size 1 \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the default value 1? |
||
| --trust-remote-code \ | ||
| --mem-fraction-static 0.8 \ | ||
| --disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
| --tokenizer-worker-num 6 \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not 8? |
||
| --enable-aiter-allreduce-fusion \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this for? |
||
| --cuda-graph-max-bs $CONC \ | ||
| --disable-radix-cache \ | ||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --scheduler-recv-interval 30 \ | ||
| --mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not 0.85? |
||
|
|
||
| SERVER_PID=$! | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,11 +19,14 @@ hf download "$MODEL" | |
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
@@ -35,9 +38,15 @@ python3 -m sglang.launch_server \ | |
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --data-parallel-size 1 \ | ||
| --trust-remote-code \ | ||
| --mem-fraction-static 0.8 \ | ||
| --disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
| --tokenizer-worker-num 6 \ | ||
| --enable-aiter-allreduce-fusion \ | ||
|
Comment on lines
+43
to
+44
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. create sglang cookbook PR for adding these flags into the recipe doc please? |
||
| --cuda-graph-max-bs $CONC \ | ||
| --disable-radix-cache \ | ||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --scheduler-recv-interval 30 \ | ||
| --mem-fraction-static 0.75 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1220,3 +1220,13 @@ | |
| - "Uses nvidia/GLM-5-NVFP4 model with modelopt_fp4 quantization" | ||
| - "Image: lmsysorg/sglang:nightly-dev-cu13-20260328-a27651d5" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/973 | ||
| - config-keys: | ||
| - qwen3.5-bf16-mi300x-sglang | ||
| - qwen3.5-bf16-mi325x-sglang | ||
| - qwen3.5-fp8-mi300x-sglang | ||
| - qwen3.5-fp8-mi325x-sglang | ||
| description: | ||
| - "Update cli args of Qwen3.5 FP8 and BF16 SGLang benchmarks for MI300X and MI325X to achieve better performance" | ||
| - "Use lmsysorg/sglang:v0.5.10rc0-rocm720-mi30x" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/986 | ||
|
|
||
|
Comment on lines
+1229
to
+1232
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The new Extended reasoning...The changelog entry added by this PR ends with The refutation argues that using a placeholder is an "established and accepted pattern" because other entries in the file also use More importantly, those other entries used Step-by-step proof:
The fix is trivial: change |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ISL+OSL+200?