KEP 5836: Add KEP for Scheduler Preemption for In-Place Pod Resize (alpha) by natasha41575 · Pull Request #5932 · kubernetes/enhancements

natasha41575 · 2026-02-23T22:45:09Z

KEP 5836: Scheduler Preemption for In-Place Pod Resize (alpha)

I have a PoC for the implementation here: kubernetes/kubernetes#137206

Issue link: #5836

Targeting 1.37 for alpha

k8s-ci-robot · 2026-02-23T22:45:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: natasha41575
Once this PR has been reviewed and has the lgtm label, please assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

natasha41575 · 2026-02-23T22:46:51Z

/sig scheduling
/cc @tallclair @dom4ha @sanposhiho @macsko

k8s-ci-robot · 2026-02-23T23:02:54Z

@natasha41575: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-test	`13f327f`	link	true	`/test pull-enhancements-test`
pull-enhancements-verify	`13f327f`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dom4ha · 2026-02-24T09:50:30Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+4. **Snapshot Adjustment**: Temporarily remove the `Deferred` pod from the node snapshot to calculate required space 
+accurately.
+5. **Calculate Victims**: Identify suitable preemption victims and then restore the pod to the snapshot.
+6. **Update Status**: Report the success or failure of the preemption attempt in the pod status. If preemption is insufficient 


Reporting success is probably not the end of the story, as we should describe how the resource accounting is done until the pod gets indeed resized on kubelet. Scheduler needs to assume somehow (see also scheduler assumption process in binding) that kubelet will accept the resize and keep blocking newly requested resources in memory.

The scheduler will automatically be blocked from using the new requested resources, because the scheduler uses max(spec, allocated, actual) when determining fit. Since a resize request is made by adjusting the spec resources, these resources are already considered "reserved" from scheduler perspective. I have a note about this under the Kubelet behavior section below. I can add one here too.

We probably can take max, but only after scheduler accepts the resize (in the mentioned assume process). But at the time the scheduler notices the Deferred state, it will create some pod-alloc-to-schedule, but can't reserve resources until it initiates preemption (set nomination) and later accept it (assume resources). The assumption can be dropped once it receives a notification about the actual resize.

Another complication is that unlike the initial pod scheduling, I suspect the requested resources may change during this process. It's not obvious how scheduler should handle such situation. Let's consider that there are effectively two resize requests A and B. Scheduler could initiate preemption for A and set nomination (block resources until preemption finishes) or even assume (wait for the resize on the Kubelete side). When the following update B comes, we probably can't just update the cache, but will need to repeat the scheduling process, but without dropping the reservation hold for A.

@dom4ha - are you sure this is really true and this is how scheduler works?

Looking at the code, it seems that for "deferred" resize requests, the resources that scheduler assumes are indeed max(spec, allocated, actual) as Natasha wrote above:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L294-L299

So if we have deferred upsize, scheduler is actually already using those resources as requested (the pod is obviously already assigned to node).

That BTW means that the in such case - the total request resources in NodeInfo may actually exceed the allocatable resources:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/types.go#L427

For me it's debatable whether it really does what it is supposed to be. But if that works as intended, then Natasha seems to be right that we actually don't need to do anything...

@macsko - FYI and for your thoughts too

the total request resources in NodeInfo may actually exceed the allocatable resources

Yes, this is intended. Deferred resizes are supposed to (and do) block the capacity, which may look like the Node is "overcommited" if you just look at the NodeInfo.

We actually discussed it in this issue, which we closed as WAI: kubernetes/kubernetes#135107 (comment). I think it is necessary that we keep doing this to prevent additional race conditions between kubelet/scheduler (and other components).

This means that the behavior today is that resizes are prioritized over scheduling new pods, so I don't think we need to change anything to make the scheduler reserve the resources.

I see. I thought that we could change the behavior to not block scheduling of other things to exactly avoid kubernetes/kubernetes#135107 (comment). So assuming that blocking resources (in "deferred" state) is desired, then indeed there is nothing else that we'd need to do here.

There is still a question how scheduler could protect workloads for which it found a suitable placement (as part of the Workload Aware Scheduling process). Since kubelet is the SoT, the scheduler needs to proactively attempt to reserve necessary resources on the kubelet side, not other way around. This is exactly how we attempted to address this problem in [1], so we seem to be aligned with WAS as well.

[1] https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w&tab=t.0#heading=h.clxvs733rwyx

dom4ha · 2026-02-24T09:54:00Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+
+### ResizeUnschedulable Pod Condition
+
+The Scheduler will own a new `ResizeUnschedulable` condition type in the pod status. This condition will be present only after


IIUC scheduler should keep trying to preempt after a preemption attempt failure similarly to how it keeps trying to schedule a pod which does not fit?

What should happen after preemption was triggered? Shall a pod wait in unschedulable queue until the preemption is finished? Note that in case of real pod scheduling, the pod is still considered unschedulable (here unresizable) until all victim pods were removed from the node. It has nominatedNodeName set to indicate that it's intendent to use resources the victims are about to free up. Once the resources are freed up and were not taken by any higher priority pod in the meantime, the nomination turns in to assignment and the unschedulable (here unresizable) condition is cleared.

I suspect we want to the resize to behave completely like pod scheduling, which means for instance setting the nominatedNodeName to indicate that this pod is going to use newly requested resources once victims disappear. Without nomination, it would be impossible to distinguish whether the pod is unresizable because there are no viable victims or because it is waiting for preemption to finish.

Note that it's important to keep the pod-under-resize until the preemption is finished, because it may always happen that a place reserved for the resize (using nomination) will be taken in the meantime by a higher priority pod, so scheduler may need to identify other victims or clear the nomination to indicate that the resize is indeed not feasible.

IIUC scheduler should keep trying to preempt after a preemption attempt failure similarly to how it keeps trying to schedule a pod which does not fit?

Yes, I think so. But I think it should try to be smart enough to only reattempt if something on the node changes that could cause preemption to succeed. I'll think about this and try to enumerate such conditions in the KEP.

For the rest of your comment, it seems related to the discussion above (#5932 (comment)).

Is the only reason to keep the pods in the scheduling queue until the preemption is finished to reserve the resources for the resize? If so let's continue the discussion above, because I still think we don't need to do anything more (unless I am missing something else). The resources are already reserved, due to (1) the pod is already bound to the node and (2) the scheduler is already today assuming resources as max(spec, allocated, actual):

https://github.com/kubernetes/kubernetes/blob/3f2ebc50eecfaeda23df4435dc82422fa65425ed/staging/src/k8s.io/component-helpers/resource/helpers.go#L287-L291

where spec in this case is the deferred upscaled resources.

Yes, I think so. But I think it should try to be smart enough to only reattempt if something on the node changes that could cause preemption to succeed. I'll think about this and try to enumerate such conditions in the KEP.

This is exactly how the initial scheduling works as well in case pods are unschedulable, so there should be no need to have any logic dedicated to the resizing.

I think there will be time to discuss details, but just to highlight, we should also take into consideration cases like when a higher priority pod needs to use resources reserved by the pod-during-resize. We probably don't want to kill the orignal pod if it's not necessary. Scheduler also needs to keep track which pods is preempting what, so there are a few details we will have to have a closer look.

wojtek-t · 2026-02-27T12:10:59Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+
+#### Story 3: Driving cluster autoscaling
+
+Preempted pods managed by workload controllers are likely to make their way back to the scheduling queue. By moving the 


This is inaccurate. Pods once preempted are reaching terminal state. Pods are never rescheduled.

There may be new pods created in response to the by appropriate controllers and they will appear in scheduling queue.

wojtek-t · 2026-02-27T12:13:57Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+When processing `Deferred` resizes, the Scheduler will skip most of the standard scheduling cycle, 
+as the primary goal is simply to trigger preemption. Instead of searching for a suitable node, the Scheduler 
+will treat these resizes as a `FitError`, automatically moving them to the preemption phase. Within 
+the preemption plugin, the search for victims must be restricted to the same node as the `Deferred` pod. Furthermore, the 


With Workload-Aware-Preemption, we may need to preempt also pods on other nodes.

So let's maybe clarify that the space needs to be freed up on the node of our deferred pod, but it may result also in preemption of pods on other nodes.

wojtek-t · 2026-02-27T12:15:01Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+
+1. **Identify Deferred Status**: Confirm the pod has a `Deferred` resize and is already bound to a node.
+2. **Trigger Preemption**: Treat the request as a `FitError` to initiate existing Scheduler preemption logic.
+3. **Isolate Node**: Within the Preemption plugin, narrow the victim search exclusively to the pod's node.


Please see my comment above about Workload-Aware-Preemption.

wojtek-t · 2026-02-27T12:43:14Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+4. **Snapshot Adjustment**: Temporarily remove the `Deferred` pod from the node snapshot to calculate required space 
+accurately.
+5. **Calculate Victims**: Identify suitable preemption victims and then restore the pod to the snapshot.
+6. **Update Status**: Report the success or failure of the preemption attempt in the pod status. If preemption is insufficient 


@dom4ha - are you sure this is really true and this is how scheduler works?

Looking at the code, it seems that for "deferred" resize requests, the resources that scheduler assumes are indeed max(spec, allocated, actual) as Natasha wrote above:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-helpers/resource/helpers.go#L294-L299

So if we have deferred upsize, scheduler is actually already using those resources as requested (the pod is obviously already assigned to node).

That BTW means that the in such case - the total request resources in NodeInfo may actually exceed the allocatable resources:
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/types.go#L427

For me it's debatable whether it really does what it is supposed to be. But if that works as intended, then Natasha seems to be right that we actually don't need to do anything...

@macsko - FYI and for your thoughts too

wojtek-t · 2026-02-27T12:49:34Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/README.md

+Consider including folks who also work outside the SIG or subproject.
+-->
+
+No risks have been identified.


I definitely think there are risks here. Maybe you already implicitly assume some mitigations or have explanations why these are acceptable, but at least the two that I have on my mind are:

Performance - if we would have significant number of deferred pods, periodic reconsideration of those can visibly affect scheduling throughput.
Presumably the mitigation is triggering reconsideration of those only if something on the node changed.

Interaction with workload-aware-preemption - the resize may trigger preemption of potentially pretty large workload.
Arguably it's probably ok given it's lower priority, but it's worth mentioning here that the effect may be visibly larger than the on node itself.

wojtek-t · 2026-02-27T12:50:11Z

keps/sig-scheduling/5836-scheduler-preemption-for-ippr/kep.yaml

+participating-sigs:
+  - sig-node
+  - sig-autoscaling
+status: implementable


Happy to the the PRR for it.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 23, 2026

k8s-ci-robot requested review from dom4ha and macsko February 23, 2026 22:45

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Feb 23, 2026

github-project-automation bot added this to SIG Scheduling Feb 23, 2026

github-project-automation bot moved this to Needs Triage in SIG Scheduling Feb 23, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 23, 2026

natasha41575 changed the title ~~KEP 5836: Scheduler Preemption for In-Place Pod Resize (alpha)~~ KEP 5836: Add KEP for Scheduler Preemption for In-Place Pod Resize (alpha) Feb 23, 2026

k8s-ci-robot requested review from sanposhiho and tallclair February 23, 2026 22:46

natasha41575 mentioned this pull request Feb 23, 2026

Scheduler Preemption for In-Place Pod Resize #5836

Open

4 tasks

natasha41575 force-pushed the scheduler-preemption branch from efac815 to 292e58e Compare February 23, 2026 22:50

KEP 5836: Scheduler Preemption for In-Place Pod Resize

13f327f

natasha41575 force-pushed the scheduler-preemption branch from 292e58e to 13f327f Compare February 23, 2026 22:58

dom4ha reviewed Feb 24, 2026

View reviewed changes

wojtek-t reviewed Feb 27, 2026

View reviewed changes

wojtek-t self-assigned this Feb 27, 2026

wojtek-t reviewed Feb 27, 2026

View reviewed changes


		### ResizeUnschedulable Pod Condition

		The Scheduler will own a new `ResizeUnschedulable` condition type in the pod status. This condition will be present only after


		#### Story 3: Driving cluster autoscaling

		Preempted pods managed by workload controllers are likely to make their way back to the scheduling queue. By moving the

Conversation

natasha41575 commented Feb 23, 2026

Uh oh!

k8s-ci-robot commented Feb 23, 2026

Uh oh!

natasha41575 commented Feb 23, 2026

Uh oh!

k8s-ci-robot commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

natasha41575 Feb 27, 2026 •

edited

Loading

natasha41575 Feb 27, 2026 •

edited

Loading