Handle errors while scaling kubernetes cluster#8107
Conversation
|
@blueorangutan package |
|
@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7393 |
|
@blueorangutan test rocky8 vmware-67u3 |
|
@harikrishna-patnala a [SF] Trillian-Jenkins test job (rocky8 mgmt + vmware-67u3) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-7993)
|
shwstppr
left a comment
There was a problem hiding this comment.
Code LGTM but needs testing. In my original issue, dynamic scaling config was set to true.
It was the hypervisor plugin that was returning error during scaling. Using same vCPU and 2GB to 4GB scaling should reproduce the original issue on VMware.
cc @harikrishna-patnala @kiranchavala
DaanHoogland
left a comment
There was a problem hiding this comment.
code looks good but I wonder whether this will improve the situation, because of the now omitted exception. needs testing
There was a problem hiding this comment.
@harikrishna-patnala found the following issue with scaling of k8s cluster when these steps are followed
-
Set the global setting "enable.dynamic.scale.vm" is set to false
-
Create a k8s cluster with 1 worker node
-
Scale the k8s cluster to 2 worker node with the same offering
-
The scaling of the k8s cluster is a success
-
Create a new compute offering
-
Scale the k8s cluster to the new compute offering and make the worker size to 1 (downscale it)
There is an internal server error
2023-10-20 07:17:55,354 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-35:ctx-dc622ce0 job-57) (logid:1fe1d102) Unexpected exception while executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd
java.lang.NullPointerException
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterOffering(KubernetesClusterScaleWorker.java:285)
at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:460)
at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:1328)
at org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd.execute(ScaleKubernetesClusterCmd.java:156)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829
- The k8s cluster is struck in scaling state .
Ideally we should also not allow scaling to the k8s cluster to the same compute offering as the "enable.dynamic.scale.vm" is set to false"
also, we should not allow down scaling of the cluster
In both cases, CloudRuntimeException is thrown @shwstppr , so this covers that error as well. |
Thanks for testing @kiranchavala . |
|
@harikrishna-patnala will you be able to make changes for cases reported by @kiranchavala ? |
and if you can't, is it worth merging without?/will you create a new issue for it? |
|
checking it now, I'll see if that case can be covered here. |
|
@harikrishna-patnala any progress/prognosis? |
|
@harikrishna-patnala any update on this? |
|
Sorry @DaanHoogland and @shwstppr could not finish this completely. Resuming it now. |
|
I'm seeing some other cases where the cluster is stuck in "Scaling" state. Trying to fix them as well. |
bd6030e to
03d1295
Compare
| @Override | ||
| public KubernetesClusterVO doInTransaction(TransactionStatus status) { | ||
| KubernetesClusterVO updatedCluster = kubernetesClusterDao.createForUpdate(kubernetesCluster.getId()); | ||
| KubernetesClusterVO updatedCluster = kubernetesClusterDao.findById(kubernetesCluster.getId()); |
There was a problem hiding this comment.
This is a bigger change because, I observed createForUpdate() is returning an object with all null or default values and assumption is that we have to update that entry for all the columns.
This is causing few issues with scaling operations
- When I tried to change the node count and change the compute offering at the same time. Compute offering change on few nodes is missing. This happened because of above code
- Above issue is causing after effects where I could not change the compute offering of the cluster anymore as there are differences in the compute offerings of the nodes
- The state issue, causing NPE (the actual bug raised here)
|
This is ready for review cc @kiranchavala @shwstppr @DaanHoogland |
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7962 |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
shwstppr
left a comment
There was a problem hiding this comment.
Added some comments. Will need testing
|
[SF] Trillian test result (tid-8499)
|
|
@blueorangutan package |
1 similar comment
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7986 |
|
@blueorangutan test |
|
@harikrishna-patnala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8534)
|
|
@kiranchavala would you like to continue testing this and also please verify the issue that you've raised |
kiranchavala
left a comment
There was a problem hiding this comment.
LGTM, tested the fix by @harikrishna-patnala and it is working fine
This PR fixes the issue apache#7920

Description
This PR fixes the issue #7920
Types of changes
Bug Severity
How Has This Been Tested?
Issue #7920 happened because of the unhandled CloudRuntime Exception, I've replicated the same scenario even when Global setting enable.dynamic.scale.vm is set to false.
Tested the same before and after fix
Before fix, state stuck in "scaling"
After fix, state first changed to "Alert" and then moved to "Running"
Also tested multiple other operations on the K8s cluster
Those are all worked fine.