Skip to content

HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953

Draft
sadanand48 wants to merge 11 commits intoapache:masterfrom
sadanand48:HDDS-14868
Draft

HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953
sadanand48 wants to merge 11 commits intoapache:masterfrom
sadanand48:HDDS-14868

Conversation

@sadanand48
Copy link
Copy Markdown
Contributor

@sadanand48 sadanand48 commented Mar 20, 2026

What changes were proposed in this pull request?

Periodic refresh — Run refresh on a ~5s (configurable) schedule instead of on every applyTransaction / refresh(false) path.

https://issues.apache.org/jira/browse/HDDS-14868

@szetszwo
Copy link
Copy Markdown
Contributor

@sadanand48 , thanks for working on this!

How about refreshing the safemode rules every 5s, instead of doing it in applyTransactions?

@sadanand48
Copy link
Copy Markdown
Contributor Author

sadanand48 commented Mar 26, 2026

How about refreshing the safemode rules every 5s, instead of doing it in applyTransactions?

Thanks @szetszwo for the input, we could make this behaviour configurable i.e periodic or based on applyTransaction. I'm saying because smaller clusters or cluster's without any pending logs may be impacted by redundant refresh calls.

@szetszwo
Copy link
Copy Markdown
Contributor

... smaller clusters or cluster's without any pending logs may be impacted by redundant refresh calls.

Refreshing the safemode rules in applyTransaction actually is a big mistake -- applyTransaction is the critical path of the StateMachine, adding unnecessary operations there is going to slow down everything.

In contrast, refreshing the safemode rules every 5s is not going to have any measurable performance impact. Hypothetically, if refreshing every 5s is not okay, then refreshing it applyTransaction is definitely much worse since there are thousands of applyTransaction ops per second.

@sadanand48 sadanand48 requested a review from szetszwo March 27, 2026 07:58
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 , thanks for the update

  • Since the current code in SCMStateMachine use SCMSafeModeManager to refresh, it is better to do refresh in SCMSafeModeManager.
  • When refresh is enabled, SCMStateMachine should not refresh.
  • During refreshing, if it is NOT in safemode, we can stop the executor. Then, we don't need any stop method.
  • It is better to create a non-mock test using MiniOzoneCluster.

See https://issues.apache.org/jira/secure/attachment/13081501/9953_review.patch

@sadanand48
Copy link
Copy Markdown
Contributor Author

Thanks @szetszwo for the review, updated as per your patch

it is better to do refresh in SCMSafeModeManager.

With this, all the safemode rules will have the same behaviour, I guess that should be okay. I will add a non-mock test

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 , thanks for the update!

Quick question:

  • Would it work if we don't make the changes in AbstractContainerSafeModeRule and other code logic changes such as isScmRatisApplyCaughtUpToCommit?

If it works, this PR should only change the refreshing time (i.e. periodic refreshing instead of doing it in SCMStateMachine.) Other code logic changes/improvement can be done in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants