fix ProviderData.Update() semantics wrt controller-runtime client cache by multi-io · Pull Request #2008 · kubermatic/machine-controller

multi-io · 2026-03-31T22:33:11Z

What this PR does / why we need it:

This changes ProviderData.Update() semantics to only modify the in-memory representation on successful updates.

Previously, if the modifier did not change anything, the function would still as a side effect fill the passed-in machine struct with the latest version from the controller-runtime cache, which is only ever updated asynchronously in the background and might be arbitrarily outdated.

So e.g. if you called this function twice in a row, the first call might've performed an update and filled the struct with the latest state of the machine from the cluster, but then the next call might've performed no update and filled the struct with an outdated state, potentially behind even what it was before the two calls.

This could lead to hard to find bugs because callers generally don't expect these semantics and proceed working with the returned outdated struct.

This PR changes the function so the machine struct that the caller passes in is only modified if the update succeeded, in which case the structure is filled with the latest state from the cluster, post-update.

If the modifier didn't change anything or the update failed, the struct is left untouched.

**Example

I haven't checked your providers, but in our fork the Openstack provider reconciles a machine kinda as follows (each Ux represents an update of the machine resource performed using ProviderData.Update()):

(U1) ensure finalizers: clean-up-openstack-server, clean-up-openstack-port
(U2) ensure finalizers: machine-delete-finalizer, machine-node-delete-finalizer
(U3) ensure finalizer: clean-up-openstack-port
create port named metakube-machine-<UID> in Openstack
(U4) set port ID annotation on the machine
(U5) ensure finalizer: clean-up-openstack-server
create the server in Openstack, with the port set to the ID from the port ID annotation
(U6) set server ID annotation on the machine

(1., 3. and 6. ensure the same finalizers because of a technicality -- we introduced these finalizers a while ago, and older machines didn't have them yet. Anyway, while redundant, this shouldn't be a problem, but it is -- see below)

The update proceeds through the port creation in step 4, it successfully updates the machine writing the port ID annotation in step 5/U4.

Then it gets to U5, where it tries to ensure the Openstack server cleanup finalizer. That finalizer already exists on the machine, usually from U1 in the same reconcile run. The cache happens to already contain U1, but not U4. So the update routine detects no diff (since the finalizer is already in the cached version) and thus it does not send an update request and changes the in-memory machine struct to the cached version, i.e. one where the port ID annotation is still "".

Then the controller proceeds to the Openstack server creation (step 7), which requires the port ID, which it gets from the machine struct, where it's set to "". So it calls the Openstack server creation with port ID "", which triggers an error (Bad network format: missing 'uuid'") in the Openstack API.

We had been seeing this error pop up randomly in our logs and in machine events (so users also reported them) and never understood where they came from until we dug into this update routine. Please note that we were involuntarily relying on the Openstack API to perform this error handling here. In other cases or other providers, this might actually change or corrupt data in unwanted ways.

There are probably alternative ways to handle this, like disabling the cache, or fixing all the call sites of the update routine, or maybe waiting for the cache to contain a particular state in certain situations. Maybe you've done some of those things, in which case this PR is not needed.

Which issue(s) this PR fixes:

Fixes #

What type of PR is this?

/kind bug

Special notes for your reviewer:

Does this PR introduce a user-facing change? Then add your Release Note here:

Documentation:

NONE

…epresentation on successful updates Previously, if the modifier did not change anything, the function would still as a side effect fill the passed-in machine struct with the latest version from the controller-runtime cache, which might be arbitrarily outdated. So e.g. if you called this function twice in a row, the first call might've performed an update and filled the struct with the latest state of the machine from the cluster, but then the next call might've performed no update and filled the struct with an outdated state, potentially behind even what it was before the two calls. This could lead to hard to find bugs because callers generally don't expect these semantics and proceed working with the returned outdated struct. This changes the function so the machine struct that the caller passes in is only modified if the update succeded, in which case the structure is filled with the latest state from the cluster, post-update. If the modifier didn't change anything or the update failed, the struct is left untouched. Signed-off-by: Olaf Klischat <olaf.klischat@gmail.com>

kubermatic-bot · 2026-03-31T22:33:18Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kubermatic-bot · 2026-03-31T22:33:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign buraksekili for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubermatic-bot · 2026-03-31T22:33:31Z

Hi @multi-io. Thanks for your PR.

I'm waiting for a kubermatic member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

Adjusts ProviderData.Update() / GetMachineUpdater() behavior so the caller-provided *Machine is only mutated when an actual update succeeds, avoiding surprising regressions caused by controller-runtime’s (potentially stale) cache on no-op updates.

Changes:

Fetches into a separate cachedMachine instead of overwriting the caller’s machine early.
Applies modifiers to cachedMachine, performs Update() only when there’s a real diff, and copies back into machine only after a successful update.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T23:55:20Z

pkg/cloudprovider/types/types.go

 			// Check if we actually change something and only update if that is the case.
-			unmodifiedMachine := machine.DeepCopy()
+			unmodifiedMachine := cachedMachine.DeepCopy()
 			for _, modify := range modifiers {
-				modify(machine)
+				modify(cachedMachine)
 			}
-			if equality.Semantic.DeepEqual(unmodifiedMachine, machine) {
+			if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) {


This change also alters modifier execution semantics: modifiers now run against cachedMachine, while the caller-provided machine remains stale until after a successful Update. Any modifier that closes over and reads machine (instead of only using the updatedMachine argument) will now make decisions on stale data and may introduce unintended diffs (e.g. pkg/cloudprovider/provider/azure/provider.go:769 checks HasFinalizer(machine, ...) inside the modifier). Consider preserving the old “modifiers see the freshly-Get()'d object” behavior by copying cachedMachine into machine before invoking modifiers and restoring the original machine value on no-op / error, or otherwise updating affected call sites and documenting that modifiers must only use their argument.

Copilot · 2026-03-31T23:55:21Z

pkg/cloudprovider/types/types.go

@@ -99,20 +99,27 @@ func GetMachineUpdater(ctx context.Context, client ctrlruntimeclient.Client) Mac
 		// Store name here, because the machine can be nil if an update failed.


The comment says the passed-in machine can be nil, but this function dereferences machine immediately (Namespace/Name) and later calls DeepCopyInto(machine), which would panic if it were nil. Either adjust the comment to reflect the actual contract (non-nil required) or add an explicit nil check that returns a clear error.

Suggested change

// Store name here, because the machine can be nil if an update failed.

if machine == nil {

return fmt.Errorf("machine must not be nil when applying modifiers")

}

// Store name here for use in the retry loop; machine is required to be non-nil.

Copilot · 2026-03-31T23:55:21Z

pkg/cloudprovider/types/types.go

+			if equality.Semantic.DeepEqual(unmodifiedMachine, cachedMachine) {
 				return nil
 			}


This is a subtle behavior change (caller object is no longer refreshed on no-op, and only updated on successful writes). Please add targeted unit tests for GetMachineUpdater to lock in the new semantics (no mutation on no-op / on update error; mutation on successful update) and to prevent regressions in conflict-retry behavior.

kubermatic-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2026

kubermatic-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 31, 2026

kubermatic-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 31, 2026

kron4eg requested a review from Copilot March 31, 2026 23:51

Copilot started reviewing on behalf of kron4eg March 31, 2026 23:51 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

kubermatic-bot added docs/none Denotes a PR that doesn't need documentation (changes). and removed do-not-merge/docs-needed Indicates that a PR should not merge because it's missing one of the documentation labels. labels Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix ProviderData.Update() semantics wrt controller-runtime client cache#2008

fix ProviderData.Update() semantics wrt controller-runtime client cache#2008
multi-io wants to merge 1 commit intokubermatic:mainfrom
multi-io:providerdata-update-semantics

multi-io commented Mar 31, 2026 •

edited

Loading

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -99,20 +99,27 @@ func GetMachineUpdater(ctx context.Context, client ctrlruntimeclient.Client) Mac
		// Store name here, because the machine can be nil if an update failed.

-		// Store name here, because the machine can be nil if an update failed.
+		if machine == nil {
+			return fmt.Errorf("machine must not be nil when applying modifiers")
+		}
+		// Store name here for use in the retry loop; machine is required to be non-nil.

Conversation

multi-io commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

kubermatic-bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

multi-io commented Mar 31, 2026 •

edited

Loading