Skip to content

DAOS-0000 rebuild: avoid abort on rank SUSPECT-to-ALIVE swim state ch…#17749

Draft
wangshilong wants to merge 1 commit intomasterfrom
shilongw/abort_rebuild
Draft

DAOS-0000 rebuild: avoid abort on rank SUSPECT-to-ALIVE swim state ch…#17749
wangshilong wants to merge 1 commit intomasterfrom
shilongw/abort_rebuild

Conversation

@wangshilong
Copy link
Contributor

@wangshilong wangshilong commented Mar 22, 2026

…ange

In a large cluster, a rank may temporarily enter SUSPECT state and recover back to ALIVE due to network jitter, without actually restarting. Aborting rebuild in this case is incorrect.

Instead, rely on the rebuild IV heartbeat: if the PS leader has not received an IV update from a rank for more than 10 minutes, it is likely that the rank restart or dead.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

…ange

In a large cluster, a rank may temporarily enter SUSPECT state and
recover back to ALIVE due to network jitter, without actually restarting.
Aborting rebuild in this case is incorrect.

Instead, rely on the rebuild IV heartbeat: if the PS leader has not
received an IV update from a rank for more than 10 minutes, it is
likely that the rank resta

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-0000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant