Skip to content

[Feature] Support mtp overlap schedule#7001

Open
Sunny-bot1 wants to merge 11 commits intoPaddlePaddle:developfrom
Sunny-bot1:mtp_merge
Open

[Feature] Support mtp overlap schedule#7001
Sunny-bot1 wants to merge 11 commits intoPaddlePaddle:developfrom
Sunny-bot1:mtp_merge

Conversation

@Sunny-bot1
Copy link
Collaborator

@Sunny-bot1 Sunny-bot1 commented Mar 24, 2026

Motivation

支持MTP(不开启logprob)场景下开启overlap schedule

Modifications

核心优化

  1. 通过预先计算或复用历史值来避免同步开销
  2. 将同步拷贝改为异步,延迟数据传输到非关键路径

位置 算子 同步操作 解决方法
gpu_model_runner.py 前处理 unified_update_model_status not_need_stop 废弃
gpu_model_runner.py 后处理 save_output accept_tokens_cpu, accept_num_cpu, seq_lens_decoder_cpu, prompt_lens_cpu 异步拷贝,延迟传输,与非MTP统一位置
mtp.py 前处理 draft_model_preprocess not_need_stop 废弃
mtp.py 前处理 eagle_get_hidden_states(output_token_num) self._mtp_input_token_num_event.synchronize() 直接使用token_num_cpu
mtp.py 前处理 eagle_get_self_hidden_states(多步) output_token_num_cpu 直接使用token_num_cpu
mtp.py 前处理 _propose_cuda(token_num_cpu) self.model_inputs["seq_lens_this_time"].numpy().sum().item() 使用主模型传过来的当前轮的real_bsz (self.share_inputs["seq_lens_this_time_cpu"].numpy() > 0).sum())
mtp.py 前处理 exist_prefill() np.any(self.share_inputs["seq_lens_encoder"].numpy() > 0) self.exist_prefill_flag
mtp.py 后处理 pre_process (real_output_token_num) self._draft_output_token_num_event.synchronize() 直接使用token_num_cpu
mtp.py 后处理 draft_model_update not_need_stop 废弃
gpu_model_runner.py 后处理 speculate_schedule_cache not_need_stop 废弃
gpu_model_runner.py 后处理 pre_process (real_output_token_num) self.output_token_num_event.synchronize() 直接使用self.share_inputs["seq_lens_this_time_cpu"].numpy() > 0).sum()
worker_process.py tp_barrier all_reduce 开启overlap schedule后自动使用CPU barrier
cudagraph_piecewise_backend.py call(num_running_requests) num_running_requests = int((seq_lens_this_time.flatten() > 0).sum().item()) 主模型复用上一轮的real_bsz((seq_lens_this_time>0).sum),MTP使用当前轮的real_bsz((seq_lens_this_time>0).sum)



注意事项

  1. 以上优化仅在 decode batch 生效,prefill/mixed 阶段保持原有逻辑
  2. 在 overlap schedule 场景下,由于空间预分配会引入无效槽位,算子中通过判断 batch_id_per_token < 0 提前退出处理
  3. 不开启 overlap 时,保持原有逻辑

GLM TP4 效果

image

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Mar 24, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 75.75758% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@7a6c287). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/spec_decode/mtp.py 82.97% 6 Missing and 2 partials ⚠️
fastdeploy/worker/gpu_model_runner.py 71.42% 3 Missing and 3 partials ⚠️
fastdeploy/worker/input_batch.py 62.50% 6 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py 55.55% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7001   +/-   ##
==========================================
  Coverage           ?   73.96%           
==========================================
  Files              ?      399           
  Lines              ?    56466           
  Branches           ?     8931           
==========================================
  Hits               ?    41765           
  Misses             ?    11720           
  Partials           ?     2981           
Flag Coverage Δ
GPU 73.96% <75.75%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI CI Agent | skill: pr_review_agent

This PR implements MTP (Multi-Token Prediction) overlap scheduling support with significant refactoring of CUDA kernels to use cooperative groups for better parallelization. The changes include removing CPU-GPU copies for better performance, adding defensive checks for negative batch indices, and modifying API signatures.

I found 1 P1 logic bug that needs to be addressed before merging.

@@ -1167,6 +1196,17 @@ def _get_self_hidden_states(self, hidden_states):
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - API Mismatch Bug: The eagle_get_self_hidden_states kernel API was changed to expect seq_lens_encoder as the 4th parameter (see eagle_get_self_hidden_states.cu line 225), but this XPU branch is still passing step_idx. The kernel now uses seq_lens_encoder[t] > 0 to detect encoder phase instead of the old step_idx[i] == 1 check. This will produce incorrect results on XPU platform. Should be self.model_inputs.last_seq_lens_encoder (similar to the CUDA branch at line 1205) instead of self.model_inputs["step_idx"].

@@ -91,19 +91,20 @@ def test_eagle_get_self_hidden_states(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - Test Inconsistency: The test passes step_idx_tensor but the kernel now expects seq_lens_encoder. While PaddlePaddle binds by position so this may not crash, the reference implementation computeOrderKernel (line 23-45) still uses step_idx == 1 logic while the actual CUDA kernel now uses seq_lens_encoder > 0 logic. The test should be updated to: 1) pass a proper seq_lens_encoder tensor, and 2) update the reference implementation to match the new kernel semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants