Skip to content

fix ProviderData.Update() semantics wrt controller-runtime client cache#2008

Open
multi-io wants to merge 1 commit intokubermatic:mainfrom
multi-io:providerdata-update-semantics
Open

fix ProviderData.Update() semantics wrt controller-runtime client cache#2008
multi-io wants to merge 1 commit intokubermatic:mainfrom
multi-io:providerdata-update-semantics

Conversation

@multi-io
Copy link
Copy Markdown
Contributor

@multi-io multi-io commented Mar 31, 2026

What this PR does / why we need it:

This changes ProviderData.Update() semantics to only modify the in-memory representation on successful updates.

Previously, if the modifier did not change anything, the function would still as a side effect fill the passed-in machine struct with the latest version from the controller-runtime cache, which is only ever updated asynchronously in the background and might be arbitrarily outdated.

So e.g. if you called this function twice in a row, the first call might've performed an update and filled the struct with the latest state of the machine from the cluster, but then the next call might've performed no update and filled the struct with an outdated state, potentially behind even what it was before the two calls.

This could lead to hard to find bugs because callers generally don't expect these semantics and proceed working with the returned outdated struct.

This PR changes the function so the machine struct that the caller passes in is only modified if the update succeeded, in which case the structure is filled with the latest state from the cluster, post-update.

If the modifier didn't change anything or the update failed, the struct is left untouched.

**Example

I haven't checked your providers, but in our fork the Openstack provider reconciles a machine kinda as follows (each Ux represents an update of the machine resource performed using ProviderData.Update()):

  1. (U1) ensure finalizers: clean-up-openstack-server, clean-up-openstack-port
  2. (U2) ensure finalizers: machine-delete-finalizer, machine-node-delete-finalizer
  3. (U3) ensure finalizer: clean-up-openstack-port
  4. create port named metakube-machine-<UID> in Openstack
  5. (U4) set port ID annotation on the machine
  6. (U5) ensure finalizer: clean-up-openstack-server
  7. create the server in Openstack, with the port set to the ID from the port ID annotation
  8. (U6) set server ID annotation on the machine

(1., 3. and 6. ensure the same finalizers because of a technicality -- we introduced these finalizers a while ago, and older machines didn't have them yet. Anyway, while redundant, this shouldn't be a problem, but it is -- see below)

The update proceeds through the port creation in step 4, it successfully updates the machine writing the port ID annotation in step 5/U4.

Then it gets to U5, where it tries to ensure the Openstack server cleanup finalizer. That finalizer already exists on the machine, usually from U1 in the same reconcile run. The cache happens to already contain U1, but not U4. So the update routine detects no diff (since the finalizer is already in the cached version) and thus it does not send an update request and changes the in-memory machine struct to the cached version, i.e. one where the port ID annotation is still "".

Then the controller proceeds to the Openstack server creation (step 7), which requires the port ID, which it gets from the machine struct, where it's set to "". So it calls the Openstack server creation with port ID "", which triggers an error (Bad network format: missing 'uuid'") in the Openstack API.

We had been seeing this error pop up randomly in our logs and in machine events (so users also reported them) and never understood where they came from until we dug into this update routine. Please note that we were involuntarily relying on the Openstack API to perform this error handling here. In other cases or other providers, this might actually change or corrupt data in unwanted ways.

There are probably alternative ways to handle this, like disabling the cache, or fixing all the call sites of the update routine, or maybe waiting for the cache to contain a particular state in certain situations. Maybe you've done some of those things, in which case this PR is not needed.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

/kind bug

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:


Documentation:

NONE

…epresentation on successful updates

Previously, if the modifier did not change anything, the function would
still as a side effect fill the passed-in machine struct with the latest
version from the controller-runtime cache, which might be arbitrarily
outdated.

So e.g. if you called this function twice in a row, the first call
might've performed an update and filled the struct with the latest state
of the machine from the cluster, but then the next call might've
performed no update and filled the struct with an outdated state,
potentially behind even what it was before the two calls.

This could lead to hard to find bugs because callers generally don't
expect these semantics and proceed working with the returned outdated
struct.

This changes the function so the machine struct that the caller passes
in is only modified if the update succeded, in which case the structure
is filled with the latest state from the cluster, post-update.

If the modifier didn't change anything or the update failed, the struct
is left untouched.

Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>
@kubermatic-bot kubermatic-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2026
@kubermatic-bot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubermatic-bot kubermatic-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. do-not-merge/docs-needed Indicates that a PR should not merge because it's missing one of the documentation labels. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Mar 31, 2026
@kubermatic-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign buraksekili for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 31, 2026
@kubermatic-bot
Copy link
Copy Markdown
Contributor

Hi @multi-io. Thanks for your PR.

I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubermatic-bot kubermatic-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 31, 2026
@kron4eg kron4eg requested a review from Copilot March 31, 2026 23:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts ProviderData.Update() / GetMachineUpdater() behavior so the caller-provided *Machine is only mutated when an actual update succeeds, avoiding surprising regressions caused by controller-runtime’s (potentially stale) cache on no-op updates.

Changes:

  • Fetches into a separate cachedMachine instead of overwriting the caller’s machine early.
  • Applies modifiers to cachedMachine, performs Update() only when there’s a real diff, and copies back into machine only after a successful update.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 107 to +112
// Check if we actually change something and only update if that is the case.
unmodifiedMachine := machine.DeepCopy()
unmodifiedMachine := cachedMachine.DeepCopy()
for _, modify := range modifiers {
modify(machine)
modify(cachedMachine)
}
if equality.Semantic.DeepEqual(unmodifiedMachine, machine) {
if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also alters modifier execution semantics: modifiers now run against cachedMachine, while the caller-provided machine remains stale until after a successful Update. Any modifier that closes over and reads machine (instead of only using the updatedMachine argument) will now make decisions on stale data and may introduce unintended diffs (e.g. pkg/cloudprovider/provider/azure/provider.go:769 checks HasFinalizer(machine, ...) inside the modifier). Consider preserving the old “modifiers see the freshly-Get()'d object” behavior by copying cachedMachine into machine before invoking modifiers and restoring the original machine value on no-op / error, or otherwise updating affected call sites and documenting that modifiers must only use their argument.

Copilot uses AI. Check for mistakes.
@@ -99,20 +99,27 @@ func GetMachineUpdater(ctx context.Context, client ctrlruntimeclient.Client) Mac
// Store name here, because the machine can be nil if an update failed.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the passed-in machine can be nil, but this function dereferences machine immediately (Namespace/Name) and later calls DeepCopyInto(machine), which would panic if it were nil. Either adjust the comment to reflect the actual contract (non-nil required) or add an explicit nil check that returns a clear error.

Suggested change
// Store name here, because the machine can be nil if an update failed.
if machine == nil {
return fmt.Errorf("machine must not be nil when applying modifiers")
}
// Store name here for use in the retry loop; machine is required to be non-nil.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to 114
if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) {
return nil
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle behavior change (caller object is no longer refreshed on no-op, and only updated on successful writes). Please add targeted unit tests for GetMachineUpdater to lock in the new semantics (no mutation on no-op / on update error; mutation on successful update) and to prevent regressions in conflict-retry behavior.

Copilot uses AI. Check for mistakes.
@kubermatic-bot kubermatic-bot added docs/none Denotes a PR that doesn't need documentation (changes). and removed do-not-merge/docs-needed Indicates that a PR should not merge because it's missing one of the documentation labels. labels Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. docs/none Denotes a PR that doesn't need documentation (changes). kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants