[Scheduler] Pre match radix tree in schedule#6989
[Scheduler] Pre match radix tree in schedule#6989juncaipeng wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 在 V1 调度流程中引入“提前在 GPU 前缀树上做只读匹配”的预检查,以便在调用 get_prefix_cached_blocks() 之前,先评估是否有足够 GPU blocks 覆盖未命中的 token,从而降低长请求多轮调度时因层级缓存匹配带来的资源门槛。
Changes:
- 新增
PrefixCacheManager.pre_match_block_on_gpu():只读遍历 radix tree,计算 GPU resident 的前缀命中 token 数。 - 在
ResourceManagerV1.schedule()与preallocate_resource_in_p()中,使用预匹配结果计算need_block_num后再做can_allocate_gpu_blocks()判断。 - 微调
request_match_blocks()中 CPU cache 预备阶段的条件分支,避免对 0 blocks 做无意义的can_allocate检查。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/engine/sched/resource_manager_v1.py | 在调度/预分配前加入 GPU 前缀预匹配后的 block 预算检查,减少层级缓存匹配导致的调度门槛/死锁风险。 |
| fastdeploy/cache_manager/prefix_cache_manager.py | 新增 GPU-only 预匹配方法,并调整 CPU cache 分配判断逻辑。 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6989 +/- ##
==========================================
Coverage ? 73.17%
==========================================
Files ? 399
Lines ? 56090
Branches ? 8855
==========================================
Hits ? 41044
Misses ? 12141
Partials ? 2905
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| match_token_num = 0 | ||
| mm_idx = 0 | ||
| prefix_block_key = [] | ||
| block_size = self.config.cache_config.block_size |
There was a problem hiding this comment.
pre_match_block_on_gpu 内部直接读取 self.config.cache_config.block_size,同时在调度侧已经显式拿到了 block_size 并据此计算 need_block_num。为了避免未来 block_size 来源不一致导致匹配/预估不一致,建议让该方法接收 block_size 参数(与现有 mm_match_block(..., block_size) / request_match_blocks(..., block_size) 保持一致),并在内部使用传入值。
| block_size = self.config.cache_config.block_size | |
| # Prefer block_size from request (set by scheduler) to avoid mismatch | |
| # with need_block_num calculation; fall back to config for backward compatibility. | |
| block_size = getattr(request, "block_size", None) or self.config.cache_config.block_size |
| match_token_num, _ = self.cache_manager.pre_match_block_on_gpu(request) | ||
| need_prefill_tokens = request.need_prefill_tokens - match_token_num | ||
| need_block_num = (need_prefill_tokens + block_size - 1) // block_size | ||
| if not self.cache_manager.can_allocate_gpu_blocks( | ||
| (request.need_prefill_tokens + self.config.cache_config.block_size - 1) | ||
| // self.config.cache_config.block_size | ||
| ): # to prevent block allocation for matching in hierarchical cache and cause dead lock | ||
| need_block_num + running_req_reserved_block_num | ||
| ): | ||
| # to prevent block allocation for matching in hierarchical cache and cause dead lock |
There was a problem hiding this comment.
这里引入了基于 pre_match_block_on_gpu 的新调度门槛计算(match_token_num / need_block_num + running_req_reserved_block_num),会改变 WAITING 请求在层级缓存开启时的可调度性与 break 条件。仓库里已有 ResourceManagerV1.schedule() 的单测(如 tests/v1/test_schedule_output.py),建议补充覆盖:GPU 前缀命中/未命中、running 保留块不足、以及层级缓存开启时是否能按预期调度/跳过请求的场景,避免回归。
| def pre_match_block_on_gpu(self, request): | ||
| """ | ||
| Pre-match request tokens against cached GPU blocks in the radix tree. | ||
|
|
||
| This method performs a prefix matching operation to find the longest sequence | ||
| of tokens that already exist in GPU cache blocks. It traverses the radix tree | ||
| from the root, computing hash values for each block-sized chunk of tokens | ||
| and checking if corresponding nodes exist with GPU-resident data. | ||
|
|
||
| Args: | ||
| request: The inference request object containing prompt_token_ids and | ||
| output_token_ids to be matched against the cache. | ||
|
|
||
| Returns: | ||
| tuple: A tuple containing: | ||
| - match_token_num (int): The total number of tokens that were | ||
| successfully matched in GPU-resident blocks. | ||
| - last_node (BlockNode): The last matched node in the radix tree, | ||
| which represents the deepest point of prefix cache hit. | ||
|
|
||
| Note: | ||
| - Only blocks with `has_in_gpu=True` are considered as valid matches. | ||
| - The matching stops at the first mismatch or when a block is not in GPU. | ||
| - This is a read-only operation that does not modify the radix tree | ||
| or LRU data structures. | ||
| """ | ||
| if isinstance(request.prompt_token_ids, np.ndarray): | ||
| prompt_token_ids = request.prompt_token_ids.tolist() | ||
| else: | ||
| prompt_token_ids = request.prompt_token_ids | ||
| input_ids = prompt_token_ids + request.output_token_ids | ||
| total_token_num = len(input_ids) | ||
|
|
||
| last_node = self.radix_tree_root | ||
| match_token_num = 0 | ||
| mm_idx = 0 | ||
| prefix_block_key = [] | ||
| block_size = self.config.cache_config.block_size | ||
|
|
||
| with self.cache_status_lock: | ||
| while match_token_num < total_token_num: | ||
| token_block = input_ids[match_token_num : match_token_num + block_size] | ||
| if len(token_block) != block_size: | ||
| break | ||
|
|
||
| mm_idx, extra_keys = self.get_block_hash_extra_keys( | ||
| request=request, | ||
| start_idx=match_token_num, | ||
| end_idx=match_token_num + block_size, | ||
| mm_idx=mm_idx, | ||
| ) | ||
| prefix_block_key.extend(extra_keys) | ||
| hash_value = get_hash_str(token_block, prefix_block_key) | ||
| prefix_block_key = [hash_value] | ||
|
|
||
| if hash_value not in last_node.children: | ||
| break | ||
|
|
||
| child = last_node.children[hash_value] | ||
| if not child.has_in_gpu: | ||
| break | ||
| match_token_num += block_size | ||
| last_node = child | ||
|
|
||
| logger.info(f"pre_match_block_on_gpu: req_id {request.request_id}, match_token_num {match_token_num}") | ||
| return ( | ||
| match_token_num, | ||
| last_node, | ||
| ) |
There was a problem hiding this comment.
pre_match_block_on_gpu 是新引入的缓存匹配入口,但目前在仓库测试中没有直接覆盖它的行为(例如:仅 GPU-resident 节点可匹配、遇到首个非 GPU 节点即停止、以及多模态 extra_keys 参与 hash 的一致性)。考虑到该方法会影响调度决策,建议在 tests/cache_manager/test_prefix_cache_manager.py 或 tests/v1/cache_manager/test_prefix_cache.py 增加针对该方法的单测,确保和现有 mm_match_block 的 hash/遍历语义一致并避免后续重构引入偏差。
Motivation
提前匹配GPU Cache,只要block足够用于没命中缓存的token,降低多轮长请求调度的门槛。
Modifications
新增pre_match_block_on_gpu
调整调用get_prefix_cached_blocks前面的判断
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.