DD 7.3: Batch drain relocationComplete to prevent fetchKeysComplete OOM by saintstack · Pull Request #12993 · apple/foundationdb

saintstack · 2026-04-14T23:44:29Z

The DDQueue choose loop processes one event per iteration. When
dataTransferComplete and other events are frequent, relocationComplete
processing is starved — fetchKeysComplete entries (erased only in the
relocationComplete handler) accumulate without bound. On one cluster this reached
33,593 entries causing OOM.

Extract completion processing into DDQueue::processRelocationComplete()
and batch-drain all ready relocationComplete events after the first
waitNext, capped at 1000 per iteration to avoid hogging the event loop.
This keeps fetchKeysComplete bounded regardless of event interleaving.

20260415-043932-stack_centos7_all_starvation-db5246251777f6a compressed=True data_size=51146319 duration=4673620 ended=99999 fail=1 fail_fast=10 max_runs=100000 pass=99998 priority=100 remaining=0:00:00 runtime=0:52:07 sanity=False started=100000 submitted=20260415-043932 timeout=5400 username=stack_centos7_all_starvation

The failure was RandomSeed="4014465418" SourceVersion="5bfd01720280600535c8bedf5c4bd2cbd4da453d" Time="1776228920" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/fast/GetMappedRange.toml"

The DDQueue choose loop processes one event per iteration. When dataTransferComplete and other events are frequent, relocationComplete processing is starved — fetchKeysComplete entries (erased only in the relocationComplete handler) accumulate unboundedly. On p67 this reached 33,593 entries causing OOM. Extract completion processing into DDQueue::processRelocationComplete() and batch-drain all ready relocationComplete events after the first waitNext, capped at 1000 per iteration to avoid hogging the event loop. This keeps fetchKeysComplete bounded regardless of event interleaving.

The done variable from waitNext is const RelocateData. Use a separate non-const variable for the drain loop.

Copilot

Pull request overview

This PR addresses a DDQueue event-loop starvation issue where relocationComplete processing can be delayed by other frequent events (e.g. dataTransferComplete), allowing fetchKeysComplete to grow without bound and potentially cause OOM. It refactors relocation-completion handling into a helper and batch-drains ready completion events to keep fetchKeysComplete bounded.

Changes:

Extracted relocation completion bookkeeping into DDQueue::processRelocationComplete().
Added a bounded batch-drain loop to process additional ready relocationComplete events (up to a fixed cap) in one choose-branch.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`fdbserver/include/fdbserver/DDRelocationQueue.h`	Declares the new `processRelocationComplete()` helper on `DDQueue`.
`fdbserver/DDRelocationQueue.actor.cpp`	Implements `processRelocationComplete()` and adds a bounded drain loop to reduce starvation/backlog.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

foundationdb-ci · 2026-04-15T00:24:30Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 0b7a802
Duration 0:39:47
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-04-15T00:24:30Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 0b7a802
Duration 0:39:47
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-04-15T00:24:30Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 0b7a802
Duration 0:39:47
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

Avoid creating temporary FutureStream objects in the drain loop. getFuture() adds/removes ref counts on each call which may interact poorly with the actor compiler.

Using getFuture().pop() on temporary FutureStream objects caused a hang. Store the FutureStream once as a state variable and use it for both waitNext in the choose block and isReady()/pop() in the drain loop.

Each noErrorActors.add(tag(delay(0)...)) in the drain loop schedules an immediate task. With many completions drained at once, this floods the task queue and causes the event loop to spend all time on system monitor callbacks (getResidentMemoryUsage) instead of making progress. Keep the delay(0) for the first completion (original behavior) but skip it for batch-drained completions. The key cleanup (fetchKeysComplete erase, activeRelocations decrement) still happens immediately.

Use post-increment (drained++ < 1000) to drain up to 1000 completions instead of 999.

foundationdb-ci · 2026-04-15T06:26:46Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 3fe923d
Duration 0:55:54
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-04-15T06:26:46Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 3fe923d
Duration 0:55:54
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-04-15T06:26:46Z

Result of foundationdb-pr-73 on Linux RHEL 9

Commit ID: 3fe923d
Duration 0:55:54
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

michael stack added 2 commits April 14, 2026 16:35

Fix const assignment error in relocationComplete batch drain

0b7a802

The done variable from waitNext is const RelocateData. Use a separate non-const variable for the drain loop.

saintstack requested a review from Copilot April 14, 2026 23:44

Copilot started reviewing on behalf of saintstack April 14, 2026 23:45 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread fdbserver/DDRelocationQueue.actor.cpp Outdated

Comment thread fdbserver/DDRelocationQueue.actor.cpp Outdated

michael stack added 4 commits April 14, 2026 17:40

Use PromiseStream::isReady() directly instead of getFuture().isReady()

cd73cf5

Avoid creating temporary FutureStream objects in the drain loop. getFuture() adds/removes ref counts on each call which may interact poorly with the actor compiler.

Store FutureStream as state variable to fix batch drain hang

ce6124c

Using getFuture().pop() on temporary FutureStream objects caused a hang. Store the FutureStream once as a state variable and use it for both waitNext in the choose block and isReady()/pop() in the drain loop.

Fix off-by-one in relocationComplete drain loop cap

3fe923d

Use post-increment (drained++ < 1000) to drain up to 1000 completions instead of 999.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DD 7.3: Batch drain relocationComplete to prevent fetchKeysComplete OOM#12993

DD 7.3: Batch drain relocationComplete to prevent fetchKeysComplete OOM#12993
saintstack wants to merge 6 commits intoapple:release-7.3from
saintstack:dd_7.3_starvation

saintstack commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

foundationdb-ci commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saintstack commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

foundationdb-ci commented Apr 15, 2026

Result of foundationdb-pr-73 on Linux RHEL 9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saintstack commented Apr 14, 2026 •

edited

Loading