DD 7.3: Batch drain relocationComplete to prevent fetchKeysComplete OOM#12993
Open
saintstack wants to merge 6 commits intoapple:release-7.3from
Open
DD 7.3: Batch drain relocationComplete to prevent fetchKeysComplete OOM#12993saintstack wants to merge 6 commits intoapple:release-7.3from
saintstack wants to merge 6 commits intoapple:release-7.3from
Conversation
added 2 commits
April 14, 2026 16:35
The DDQueue choose loop processes one event per iteration. When dataTransferComplete and other events are frequent, relocationComplete processing is starved — fetchKeysComplete entries (erased only in the relocationComplete handler) accumulate unboundedly. On p67 this reached 33,593 entries causing OOM. Extract completion processing into DDQueue::processRelocationComplete() and batch-drain all ready relocationComplete events after the first waitNext, capped at 1000 per iteration to avoid hogging the event loop. This keeps fetchKeysComplete bounded regardless of event interleaving.
The done variable from waitNext is const RelocateData. Use a separate non-const variable for the drain loop.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a DDQueue event-loop starvation issue where relocationComplete processing can be delayed by other frequent events (e.g. dataTransferComplete), allowing fetchKeysComplete to grow without bound and potentially cause OOM. It refactors relocation-completion handling into a helper and batch-drains ready completion events to keep fetchKeysComplete bounded.
Changes:
- Extracted relocation completion bookkeeping into
DDQueue::processRelocationComplete(). - Added a bounded batch-drain loop to process additional ready
relocationCompleteevents (up to a fixed cap) in one choose-branch.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
fdbserver/include/fdbserver/DDRelocationQueue.h |
Declares the new processRelocationComplete() helper on DDQueue. |
fdbserver/DDRelocationQueue.actor.cpp |
Implements processRelocationComplete() and adds a bounded drain loop to reduce starvation/backlog. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
added 4 commits
April 14, 2026 17:40
Avoid creating temporary FutureStream objects in the drain loop. getFuture() adds/removes ref counts on each call which may interact poorly with the actor compiler.
Using getFuture().pop() on temporary FutureStream objects caused a hang. Store the FutureStream once as a state variable and use it for both waitNext in the choose block and isReady()/pop() in the drain loop.
Each noErrorActors.add(tag(delay(0)...)) in the drain loop schedules an immediate task. With many completions drained at once, this floods the task queue and causes the event loop to spend all time on system monitor callbacks (getResidentMemoryUsage) instead of making progress. Keep the delay(0) for the first completion (original behavior) but skip it for batch-drained completions. The key cleanup (fetchKeysComplete erase, activeRelocations decrement) still happens immediately.
Use post-increment (drained++ < 1000) to drain up to 1000 completions instead of 999.
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-73 on Linux RHEL 9
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
20260415-043932-stack_centos7_all_starvation-db5246251777f6a compressed=True data_size=51146319 duration=4673620 ended=99999 fail=1 fail_fast=10 max_runs=100000 pass=99998 priority=100 remaining=0:00:00 runtime=0:52:07 sanity=False started=100000 submitted=20260415-043932 timeout=5400 username=stack_centos7_all_starvationThe failure was RandomSeed="4014465418" SourceVersion="5bfd01720280600535c8bedf5c4bd2cbd4da453d" Time="1776228920" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/fast/GetMappedRange.toml"