Skip to content

[Scheduler] Pre match radix tree in schedule#6989

Open
juncaipeng wants to merge 3 commits intoPaddlePaddle:developfrom
juncaipeng:pre_match_tree
Open

[Scheduler] Pre match radix tree in schedule#6989
juncaipeng wants to merge 3 commits intoPaddlePaddle:developfrom
juncaipeng:pre_match_tree

Conversation

@juncaipeng
Copy link
Collaborator

Motivation

提前匹配GPU Cache,只要block足够用于没命中缓存的token,降低多轮长请求调度的门槛。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新增pre_match_block_on_gpu
调整调用get_prefix_cached_blocks前面的判断

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 24, 2026 08:47
@paddle-bot
Copy link

paddle-bot bot commented Mar 24, 2026

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 在 V1 调度流程中引入“提前在 GPU 前缀树上做只读匹配”的预检查,以便在调用 get_prefix_cached_blocks() 之前,先评估是否有足够 GPU blocks 覆盖未命中的 token,从而降低长请求多轮调度时因层级缓存匹配带来的资源门槛。

Changes:

  • 新增 PrefixCacheManager.pre_match_block_on_gpu():只读遍历 radix tree,计算 GPU resident 的前缀命中 token 数。
  • ResourceManagerV1.schedule()preallocate_resource_in_p() 中,使用预匹配结果计算 need_block_num 后再做 can_allocate_gpu_blocks() 判断。
  • 微调 request_match_blocks() 中 CPU cache 预备阶段的条件分支,避免对 0 blocks 做无意义的 can_allocate 检查。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/engine/sched/resource_manager_v1.py 在调度/预分配前加入 GPU 前缀预匹配后的 block 预算检查,减少层级缓存匹配导致的调度门槛/死锁风险。
fastdeploy/cache_manager/prefix_cache_manager.py 新增 GPU-only 预匹配方法,并调整 CPU cache 分配判断逻辑。

@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 43.13725% with 29 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@5e469fc). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 12.50% 28 Missing ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6989   +/-   ##
==========================================
  Coverage           ?   73.17%           
==========================================
  Files              ?      399           
  Lines              ?    56090           
  Branches           ?     8855           
==========================================
  Hits               ?    41044           
  Misses             ?    12141           
  Partials           ?     2905           
Flag Coverage Δ
GPU 73.17% <43.13%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings March 25, 2026 06:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

match_token_num = 0
mm_idx = 0
prefix_block_key = []
block_size = self.config.cache_config.block_size
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_match_block_on_gpu 内部直接读取 self.config.cache_config.block_size,同时在调度侧已经显式拿到了 block_size 并据此计算 need_block_num。为了避免未来 block_size 来源不一致导致匹配/预估不一致,建议让该方法接收 block_size 参数(与现有 mm_match_block(..., block_size) / request_match_blocks(..., block_size) 保持一致),并在内部使用传入值。

Suggested change
block_size = self.config.cache_config.block_size
# Prefer block_size from request (set by scheduler) to avoid mismatch
# with need_block_num calculation; fall back to config for backward compatibility.
block_size = getattr(request, "block_size", None) or self.config.cache_config.block_size

Copilot uses AI. Check for mistakes.
Comment on lines +969 to +975
match_token_num, _ = self.cache_manager.pre_match_block_on_gpu(request)
need_prefill_tokens = request.need_prefill_tokens - match_token_num
need_block_num = (need_prefill_tokens + block_size - 1) // block_size
if not self.cache_manager.can_allocate_gpu_blocks(
(request.need_prefill_tokens + self.config.cache_config.block_size - 1)
// self.config.cache_config.block_size
): # to prevent block allocation for matching in hierarchical cache and cause dead lock
need_block_num + running_req_reserved_block_num
):
# to prevent block allocation for matching in hierarchical cache and cause dead lock
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里引入了基于 pre_match_block_on_gpu 的新调度门槛计算(match_token_num / need_block_num + running_req_reserved_block_num),会改变 WAITING 请求在层级缓存开启时的可调度性与 break 条件。仓库里已有 ResourceManagerV1.schedule() 的单测(如 tests/v1/test_schedule_output.py),建议补充覆盖:GPU 前缀命中/未命中、running 保留块不足、以及层级缓存开启时是否能按预期调度/跳过请求的场景,避免回归。

Copilot uses AI. Check for mistakes.
Comment on lines +1752 to +1820
def pre_match_block_on_gpu(self, request):
"""
Pre-match request tokens against cached GPU blocks in the radix tree.

This method performs a prefix matching operation to find the longest sequence
of tokens that already exist in GPU cache blocks. It traverses the radix tree
from the root, computing hash values for each block-sized chunk of tokens
and checking if corresponding nodes exist with GPU-resident data.

Args:
request: The inference request object containing prompt_token_ids and
output_token_ids to be matched against the cache.

Returns:
tuple: A tuple containing:
- match_token_num (int): The total number of tokens that were
successfully matched in GPU-resident blocks.
- last_node (BlockNode): The last matched node in the radix tree,
which represents the deepest point of prefix cache hit.

Note:
- Only blocks with `has_in_gpu=True` are considered as valid matches.
- The matching stops at the first mismatch or when a block is not in GPU.
- This is a read-only operation that does not modify the radix tree
or LRU data structures.
"""
if isinstance(request.prompt_token_ids, np.ndarray):
prompt_token_ids = request.prompt_token_ids.tolist()
else:
prompt_token_ids = request.prompt_token_ids
input_ids = prompt_token_ids + request.output_token_ids
total_token_num = len(input_ids)

last_node = self.radix_tree_root
match_token_num = 0
mm_idx = 0
prefix_block_key = []
block_size = self.config.cache_config.block_size

with self.cache_status_lock:
while match_token_num < total_token_num:
token_block = input_ids[match_token_num : match_token_num + block_size]
if len(token_block) != block_size:
break

mm_idx, extra_keys = self.get_block_hash_extra_keys(
request=request,
start_idx=match_token_num,
end_idx=match_token_num + block_size,
mm_idx=mm_idx,
)
prefix_block_key.extend(extra_keys)
hash_value = get_hash_str(token_block, prefix_block_key)
prefix_block_key = [hash_value]

if hash_value not in last_node.children:
break

child = last_node.children[hash_value]
if not child.has_in_gpu:
break
match_token_num += block_size
last_node = child

logger.info(f"pre_match_block_on_gpu: req_id {request.request_id}, match_token_num {match_token_num}")
return (
match_token_num,
last_node,
)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_match_block_on_gpu 是新引入的缓存匹配入口,但目前在仓库测试中没有直接覆盖它的行为(例如:仅 GPU-resident 节点可匹配、遇到首个非 GPU 节点即停止、以及多模态 extra_keys 参与 hash 的一致性)。考虑到该方法会影响调度决策,建议在 tests/cache_manager/test_prefix_cache_manager.pytests/v1/cache_manager/test_prefix_cache.py 增加针对该方法的单测,确保和现有 mm_match_block 的 hash/遍历语义一致并避免后续重构引入偏差。

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants