[Scheduler] Pre match radix tree in schedule by juncaipeng · Pull Request #6989 · PaddlePaddle/FastDeploy

juncaipeng · 2026-03-24T08:47:34Z

Motivation

提前匹配GPU Cache，只要block足够用于没命中缓存的token，降低多轮长请求调度的门槛。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新增pre_match_block_on_gpu
调整调用get_prefix_cached_blocks前面的判断

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-24T08:47:41Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 在 V1 调度流程中引入“提前在 GPU 前缀树上做只读匹配”的预检查，以便在调用 get_prefix_cached_blocks() 之前，先评估是否有足够 GPU blocks 覆盖未命中的 token，从而降低长请求多轮调度时因层级缓存匹配带来的资源门槛。

Changes:

新增 PrefixCacheManager.pre_match_block_on_gpu()：只读遍历 radix tree，计算 GPU resident 的前缀命中 token 数。
在 ResourceManagerV1.schedule() 与 preallocate_resource_in_p() 中，使用预匹配结果计算 need_block_num 后再做 can_allocate_gpu_blocks() 判断。
微调 request_match_blocks() 中 CPU cache 预备阶段的条件分支，避免对 0 blocks 做无意义的 can_allocate 检查。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
fastdeploy/engine/sched/resource_manager_v1.py	在调度/预分配前加入 GPU 前缀预匹配后的 block 预算检查，减少层级缓存匹配导致的调度门槛/死锁风险。
fastdeploy/cache_manager/prefix_cache_manager.py	新增 GPU-only 预匹配方法，并调整 CPU cache 分配判断逻辑。

fastdeploy/cache_manager/prefix_cache_manager.py

fastdeploy/engine/sched/resource_manager_v1.py

fastdeploy/cache_manager/prefix_cache_manager.py

codecov-commenter · 2026-03-24T10:37:51Z

Codecov Report

❌ Patch coverage is 43.13725% with 29 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@5e469fc). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/prefix_cache_manager.py	12.50%	28 Missing ⚠️
fastdeploy/engine/sched/resource_manager_v1.py	94.44%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6989   +/-   ##
==========================================
  Coverage           ?   73.17%           
==========================================
  Files              ?      399           
  Lines              ?    56090           
  Branches           ?     8855           
==========================================
  Hits               ?    41044           
  Misses             ?    12141           
  Partials           ?     2905

Flag	Coverage Δ
GPU	`73.17% <43.13%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-25T06:43:38Z

fastdeploy/cache_manager/prefix_cache_manager.py

+        match_token_num = 0
+        mm_idx = 0
+        prefix_block_key = []
+        block_size = self.config.cache_config.block_size


pre_match_block_on_gpu 内部直接读取 self.config.cache_config.block_size，同时在调度侧已经显式拿到了 block_size 并据此计算 need_block_num。为了避免未来 block_size 来源不一致导致匹配/预估不一致，建议让该方法接收 block_size 参数（与现有 mm_match_block(..., block_size) / request_match_blocks(..., block_size) 保持一致），并在内部使用传入值。

Suggested change

block_size = self.config.cache_config.block_size

# Prefer block_size from request (set by scheduler) to avoid mismatch

# with need_block_num calculation; fall back to config for backward compatibility.

block_size = getattr(request, "block_size", None) or self.config.cache_config.block_size

Copilot · 2026-03-25T06:43:39Z

fastdeploy/engine/sched/resource_manager_v1.py

+                                match_token_num, _ = self.cache_manager.pre_match_block_on_gpu(request)
+                                need_prefill_tokens = request.need_prefill_tokens - match_token_num
+                                need_block_num = (need_prefill_tokens + block_size - 1) // block_size
                                if not self.cache_manager.can_allocate_gpu_blocks(
-                                    (request.need_prefill_tokens + self.config.cache_config.block_size - 1)
-                                    // self.config.cache_config.block_size
-                                ):  # to prevent block allocation for matching in hierarchical cache and cause dead lock
+                                    need_block_num + running_req_reserved_block_num
+                                ):
+                                    # to prevent block allocation for matching in hierarchical cache and cause dead lock


这里引入了基于 pre_match_block_on_gpu 的新调度门槛计算（match_token_num / need_block_num + running_req_reserved_block_num），会改变 WAITING 请求在层级缓存开启时的可调度性与 break 条件。仓库里已有 ResourceManagerV1.schedule() 的单测（如 tests/v1/test_schedule_output.py），建议补充覆盖：GPU 前缀命中/未命中、running 保留块不足、以及层级缓存开启时是否能按预期调度/跳过请求的场景，避免回归。

Copilot · 2026-03-25T06:43:39Z

fastdeploy/cache_manager/prefix_cache_manager.py

+    def pre_match_block_on_gpu(self, request):
+        """
+        Pre-match request tokens against cached GPU blocks in the radix tree.
+
+        This method performs a prefix matching operation to find the longest sequence
+        of tokens that already exist in GPU cache blocks. It traverses the radix tree
+        from the root, computing hash values for each block-sized chunk of tokens
+        and checking if corresponding nodes exist with GPU-resident data.
+
+        Args:
+            request: The inference request object containing prompt_token_ids and
+                     output_token_ids to be matched against the cache.
+
+        Returns:
+            tuple: A tuple containing:
+                - match_token_num (int): The total number of tokens that were
+                  successfully matched in GPU-resident blocks.
+                - last_node (BlockNode): The last matched node in the radix tree,
+                  which represents the deepest point of prefix cache hit.
+
+        Note:
+            - Only blocks with `has_in_gpu=True` are considered as valid matches.
+            - The matching stops at the first mismatch or when a block is not in GPU.
+            - This is a read-only operation that does not modify the radix tree
+              or LRU data structures.
+        """
+        if isinstance(request.prompt_token_ids, np.ndarray):
+            prompt_token_ids = request.prompt_token_ids.tolist()
+        else:
+            prompt_token_ids = request.prompt_token_ids
+        input_ids = prompt_token_ids + request.output_token_ids
+        total_token_num = len(input_ids)
+
+        last_node = self.radix_tree_root
+        match_token_num = 0
+        mm_idx = 0
+        prefix_block_key = []
+        block_size = self.config.cache_config.block_size
+
+        with self.cache_status_lock:
+            while match_token_num < total_token_num:
+                token_block = input_ids[match_token_num : match_token_num + block_size]
+                if len(token_block) != block_size:
+                    break
+
+                mm_idx, extra_keys = self.get_block_hash_extra_keys(
+                    request=request,
+                    start_idx=match_token_num,
+                    end_idx=match_token_num + block_size,
+                    mm_idx=mm_idx,
+                )
+                prefix_block_key.extend(extra_keys)
+                hash_value = get_hash_str(token_block, prefix_block_key)
+                prefix_block_key = [hash_value]
+
+                if hash_value not in last_node.children:
+                    break
+
+                child = last_node.children[hash_value]
+                if not child.has_in_gpu:
+                    break
+                match_token_num += block_size
+                last_node = child
+
+        logger.info(f"pre_match_block_on_gpu: req_id {request.request_id}, match_token_num {match_token_num}")
+        return (
+            match_token_num,
+            last_node,
+        )


pre_match_block_on_gpu 是新引入的缓存匹配入口，但目前在仓库测试中没有直接覆盖它的行为（例如：仅 GPU-resident 节点可匹配、遇到首个非 GPU 节点即停止、以及多模态 extra_keys 参与 hash 的一致性）。考虑到该方法会影响调度决策，建议在 tests/cache_manager/test_prefix_cache_manager.py 或 tests/v1/cache_manager/test_prefix_cache.py 增加针对该方法的单测，确保和现有 mm_match_block 的 hash/遍历语义一致并避免后续重构引入偏差。

Pre match radix tree in schedule

c02c2e2

Copilot AI review requested due to automatic review settings March 24, 2026 08:47

juncaipeng temporarily deployed to Metax_ci March 24, 2026 08:47 — with GitHub Actions Inactive

Copilot started reviewing on behalf of juncaipeng March 24, 2026 08:48 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

up

78cca72

juncaipeng temporarily deployed to Metax_ci March 24, 2026 11:04 — with GitHub Actions Inactive

Avoid prematurely scheduling requests in the waiting queue

31c29c5

Copilot AI review requested due to automatic review settings March 25, 2026 06:38

juncaipeng temporarily deployed to Metax_ci March 25, 2026 06:38 — with GitHub Actions Inactive

Copilot started reviewing on behalf of juncaipeng March 25, 2026 06:39 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scheduler] Pre match radix tree in schedule#6989

[Scheduler] Pre match radix tree in schedule#6989
juncaipeng wants to merge 3 commits intoPaddlePaddle:developfrom
juncaipeng:pre_match_tree

juncaipeng commented Mar 24, 2026

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        block_size = self.config.cache_config.block_size
+        # Prefer block_size from request (set by scheduler) to avoid mismatch
+        # with need_block_num calculation; fall back to config for backward compatibility.
+        block_size = getattr(request, "block_size", None) or self.config.cache_config.block_size

Conversation

juncaipeng commented Mar 24, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 24, 2026 •

edited

Loading