HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953
HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953sadanand48 wants to merge 11 commits intoapache:masterfrom
Conversation
…te of ContainerSafemodeRule.
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
|
@sadanand48 , thanks for working on this! How about refreshing the safemode rules every 5s, instead of doing it in applyTransactions? |
Thanks @szetszwo for the input, we could make this behaviour configurable i.e periodic or based on applyTransaction. I'm saying because smaller clusters or cluster's without any pending logs may be impacted by redundant refresh calls. |
Refreshing the safemode rules in applyTransaction actually is a big mistake -- applyTransaction is the critical path of the StateMachine, adding unnecessary operations there is going to slow down everything. In contrast, refreshing the safemode rules every 5s is not going to have any measurable performance impact. Hypothetically, if refreshing every 5s is not okay, then refreshing it applyTransaction is definitely much worse since there are thousands of applyTransaction ops per second. |
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update
- Since the current code in SCMStateMachine use SCMSafeModeManager to refresh, it is better to do refresh in SCMSafeModeManager.
- When refresh is enabled, SCMStateMachine should not refresh.
- During refreshing, if it is NOT in safemode, we can stop the executor. Then, we don't need any stop method.
- It is better to create a non-mock test using MiniOzoneCluster.
See https://issues.apache.org/jira/secure/attachment/13081501/9953_review.patch
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
|
Thanks @szetszwo for the review, updated as per your patch
With this, all the safemode rules will have the same behaviour, I guess that should be okay. I will add a non-mock test |
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update!
Quick question:
- Would it work if we don't make the changes in AbstractContainerSafeModeRule and other code logic changes such as isScmRatisApplyCaughtUpToCommit?
If it works, this PR should only change the refreshing time (i.e. periodic refreshing instead of doing it in SCMStateMachine.) Other code logic changes/improvement can be done in a separate PR.
What changes were proposed in this pull request?
Periodic refresh — Run refresh on a ~5s (configurable) schedule instead of on every applyTransaction / refresh(false) path.
https://issues.apache.org/jira/browse/HDDS-14868