Handle errors while scaling kubernetes cluster by harikrishna-patnala · Pull Request #8107 · apache/cloudstack

harikrishna-patnala · 2023-10-17T05:21:34Z

Description

This PR fixes the issue #7920

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Bug Severity

How Has This Been Tested?

Issue #7920 happened because of the unhandled CloudRuntime Exception, I've replicated the same scenario even when Global setting enable.dynamic.scale.vm is set to false.

Tested the same before and after fix

Before fix, state stuck in "scaling"

After fix, state first changed to "Alert" and then moved to "Running"

Also tested multiple other operations on the K8s cluster

Scale up the cluster with node count
Scale down the cluster with node count
Combinations of node count and service offering change

Those are all worked fine.

harikrishna-patnala · 2023-10-17T05:51:39Z

@blueorangutan package

blueorangutan · 2023-10-17T05:52:03Z

@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-10-17T06:56:00Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7393

harikrishna-patnala · 2023-10-17T12:48:38Z

@blueorangutan test rocky8 vmware-67u3

blueorangutan · 2023-10-17T12:50:03Z

@harikrishna-patnala a [SF] Trillian-Jenkins test job (rocky8 mgmt + vmware-67u3) has been kicked to run smoke tests

blueorangutan · 2023-10-18T03:18:34Z

[SF] Trillian test result (tid-7993)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 50614 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t7993-vmware-67u3.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_deploy_vm_on_specific_host	`Error`	3603.69	test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster	`Error`	4.38	test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod	`Error`	4.40	test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster	`Error`	4.46	test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod	`Error`	12.56	test_vm_deployment_planner.py

yadvr

LGTM - needs testing.

shwstppr

Code LGTM but needs testing. In my original issue, dynamic scaling config was set to true.
It was the hypervisor plugin that was returning error during scaling. Using same vCPU and 2GB to 4GB scaling should reproduce the original issue on VMware.
cc @harikrishna-patnala @kiranchavala

DaanHoogland

code looks good but I wonder whether this will improve the situation, because of the now omitted exception. needs testing

kiranchavala

@harikrishna-patnala found the following issue with scaling of k8s cluster when these steps are followed

Set the global setting "enable.dynamic.scale.vm" is set to false
Create a k8s cluster with 1 worker node
Scale the k8s cluster to 2 worker node with the same offering
The scaling of the k8s cluster is a success
Create a new compute offering
Scale the k8s cluster to the new compute offering and make the worker size to 1 (downscale it)

There is an internal server error

2023-10-20 07:17:55,354 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-35:ctx-dc622ce0 job-57) (logid:1fe1d102) Unexpected exception while executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd
java.lang.NullPointerException
        at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterOffering(KubernetesClusterScaleWorker.java:285)
        at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:460)
        at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:1328)
        at org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd.execute(ScaleKubernetesClusterCmd.java:156)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829

The k8s cluster is struck in scaling state .

Ideally we should also not allow scaling to the k8s cluster to the same compute offering as the "enable.dynamic.scale.vm" is set to false"

also, we should not allow down scaling of the cluster

harikrishna-patnala · 2023-10-26T05:46:47Z

Code LGTM but needs testing. In my original issue, dynamic scaling config was set to true. It was the hypervisor plugin that was returning error during scaling. Using same vCPU and 2GB to 4GB scaling should reproduce the original issue on VMware. cc @harikrishna-patnala @kiranchavala

In both cases, CloudRuntimeException is thrown @shwstppr , so this covers that error as well.

harikrishna-patnala · 2023-10-26T05:53:56Z

KubernetesClusterScaleWorker.java:460

Thanks for testing @kiranchavala .
For sure this error or this case is not part of this PR but worth checking the case and fix it. I'll test it again with your reproduction steps.

shwstppr · 2023-11-08T09:09:19Z

@harikrishna-patnala will you be able to make changes for cases reported by @kiranchavala ?

DaanHoogland · 2023-11-08T13:46:39Z

@harikrishna-patnala will you be able to make changes for cases reported by @kiranchavala ?

and if you can't, is it worth merging without?/will you create a new issue for it?

harikrishna-patnala · 2023-11-10T07:07:35Z

checking it now, I'll see if that case can be covered here.

DaanHoogland · 2023-11-15T14:24:53Z

@harikrishna-patnala any progress/prognosis?

shwstppr · 2023-12-05T10:55:30Z

@harikrishna-patnala any update on this?

harikrishna-patnala · 2023-12-06T08:59:13Z

Sorry @DaanHoogland and @shwstppr could not finish this completely. Resuming it now.

harikrishna-patnala · 2023-12-07T06:05:57Z

I'm seeing some other cases where the cluster is stuck in "Scaling" state. Trying to fix them as well.

harikrishna-patnala · 2023-12-07T07:54:25Z

                @Override
                public KubernetesClusterVO doInTransaction(TransactionStatus status) {
-                KubernetesClusterVO updatedCluster = kubernetesClusterDao.createForUpdate(kubernetesCluster.getId());
+                KubernetesClusterVO updatedCluster = kubernetesClusterDao.findById(kubernetesCluster.getId());


This is a bigger change because, I observed createForUpdate() is returning an object with all null or default values and assumption is that we have to update that entry for all the columns.

This is causing few issues with scaling operations

When I tried to change the node count and change the compute offering at the same time. Compute offering change on few nodes is missing. This happened because of above code

Above issue is causing after effects where I could not change the compute offering of the cluster anymore as there are differences in the compute offerings of the nodes

The state issue, causing NPE (the actual bug raised here)

harikrishna-patnala · 2023-12-07T07:59:26Z

This is ready for review cc @kiranchavala @shwstppr @DaanHoogland

harikrishna-patnala · 2023-12-07T07:59:55Z

@blueorangutan package

blueorangutan · 2023-12-07T08:02:03Z

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2023-12-07T08:56:15Z

Codecov Report

Attention: 900 lines in your changes are missing coverage. Please review.

Comparison is base (29c7b31) 13.02% compared to head (ccf9d1e) 13.11%.
Report is 115 commits behind head on 4.18.

Files	Patch %	Lines
...n/java/com/cloud/network/IpAddressManagerImpl.java	11.01%	105 Missing ⚠️
...ava/com/cloud/upgrade/dao/Upgrade41800to41810.java	1.05%	94 Missing ⚠️
...java/com/cloud/agent/manager/AgentManagerImpl.java	0.00%	66 Missing ⚠️
...src/main/java/com/cloud/upgrade/GuestOsMapper.java	20.73%	65 Missing ⚠️
...ud/hypervisor/kvm/storage/KVMStorageProcessor.java	0.00%	47 Missing and 1 partial ⚠️
...ain/java/com/cloud/api/query/QueryManagerImpl.java	0.00%	35 Missing ⚠️
...ain/java/com/cloud/storage/dao/GuestOSDaoImpl.java	0.00%	28 Missing ⚠️
...in/java/com/cloud/server/ManagementServerImpl.java	0.00%	24 Missing ⚠️
.../apache/cloudstack/vm/UnmanagedVMsManagerImpl.java	48.93%	19 Missing and 5 partials ⚠️
...va/org/apache/cloudstack/ldap/LdapManagerImpl.java	0.00%	20 Missing ⚠️
... and 59 more

Additional details and impacted files

@@             Coverage Diff              @@
##               4.18    #8107      +/-   ##
============================================
+ Coverage     13.02%   13.11%   +0.09%     
- Complexity     9032     9134     +102     
============================================
  Files          2720     2720              
  Lines        257080   257674     +594     
  Branches      40088    40174      +86     
============================================
+ Hits          33476    33803     +327     
- Misses       219400   219579     +179     
- Partials       4204     4292      +88

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

blueorangutan · 2023-12-07T09:04:40Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7962

DaanHoogland · 2023-12-07T09:16:57Z

@blueorangutan test

blueorangutan · 2023-12-07T09:18:04Z

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

shwstppr

Added some comments. Will need testing

blueorangutan · 2023-12-07T21:03:01Z

[SF] Trillian test result (tid-8499)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40740 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t8499-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

shwstppr · 2023-12-08T09:58:45Z

@blueorangutan package

harikrishna-patnala · 2023-12-08T10:10:45Z

@blueorangutan package

blueorangutan · 2023-12-08T10:12:03Z

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2023-12-08T11:12:17Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7986

harikrishna-patnala · 2023-12-08T22:44:23Z

@blueorangutan test

blueorangutan · 2023-12-08T22:46:03Z

@harikrishna-patnala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2023-12-09T10:33:54Z

[SF] Trillian test result (tid-8534)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40913 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t8534-kvm-centos7.zip
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_migrate_vm	`Error`	45.83	test_vm_life_cycle.py

harikrishna-patnala · 2023-12-11T04:18:58Z

@kiranchavala would you like to continue testing this and also please verify the issue that you've raised

kiranchavala

LGTM, tested the fix by @harikrishna-patnala and it is working fine

shwstppr

Code LGTM

This PR fixes the issue apache#7920

boring-cyborg Bot added the component:kubernetes label Oct 17, 2023

shwstppr added this to the 4.18.2.0 milestone Oct 17, 2023

yadvr approved these changes Oct 18, 2023

View reviewed changes

DaanHoogland added the status:needs-testing label Oct 18, 2023

kiranchavala self-assigned this Oct 19, 2023

yadvr requested review from DaanHoogland and shwstppr October 19, 2023 11:26

shwstppr reviewed Oct 19, 2023

View reviewed changes

DaanHoogland reviewed Oct 19, 2023

View reviewed changes

Comment thread ...e/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java Outdated

kiranchavala reviewed Oct 20, 2023

View reviewed changes

DaanHoogland assigned harikrishna-patnala Nov 15, 2023

harikrishna-patnala added 2 commits December 7, 2023 13:06

Handle errors while scaling kubernetes cluster

623b018

Fixed state issues and also applying new compute offering to all nodes

03d1295

harikrishna-patnala force-pushed the HandleScaleK8sError branch from bd6030e to 03d1295 Compare December 7, 2023 07:37

harikrishna-patnala commented Dec 7, 2023

View reviewed changes

Format code

6a419bc

shwstppr reviewed Dec 7, 2023

View reviewed changes

Comment thread ...om/cloud/kubernetes/cluster/actionworkers/KubernetesClusterResourceModifierActionWorker.java Outdated

Comment thread ...e/src/main/java/com/cloud/kubernetes/cluster/actionworkers/KubernetesClusterScaleWorker.java Outdated

revert manual state changes

ccf9d1e

kiranchavala approved these changes Dec 12, 2023

View reviewed changes

shwstppr approved these changes Dec 12, 2023

View reviewed changes

shwstppr merged commit 3ce7c39 into apache:4.18 Dec 12, 2023

shwstppr deleted the HandleScaleK8sError branch December 12, 2023 11:27

shwstppr mentioned this pull request Dec 12, 2023

cks: k8s cluster continue to show in scaling state even after failure #7920

Closed

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Dec 15, 2023

cks: handle errors while scaling cluster (apache#8107)

37ef0dc

This PR fixes the issue apache#7920

ugenk mentioned this pull request Apr 22, 2024

problem on adding new node to kubernetes #8943

Closed

DaanHoogland removed the status:needs-testing label Jan 28, 2026

Conversation

harikrishna-patnala commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Bug Severity

How Has This Been Tested?

Uh oh!

harikrishna-patnala commented Oct 17, 2023

Uh oh!

blueorangutan commented Oct 17, 2023

Uh oh!

blueorangutan commented Oct 17, 2023

Uh oh!

harikrishna-patnala commented Oct 17, 2023

Uh oh!

blueorangutan commented Oct 17, 2023

Uh oh!

blueorangutan commented Oct 18, 2023

Uh oh!

yadvr left a comment

Choose a reason for hiding this comment

Uh oh!

shwstppr left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kiranchavala left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harikrishna-patnala commented Oct 26, 2023

Uh oh!

harikrishna-patnala commented Oct 26, 2023

Uh oh!

shwstppr commented Nov 8, 2023

Uh oh!

DaanHoogland commented Nov 8, 2023

Uh oh!

harikrishna-patnala commented Nov 10, 2023

Uh oh!

DaanHoogland commented Nov 15, 2023

Uh oh!

shwstppr commented Dec 5, 2023

Uh oh!

harikrishna-patnala commented Dec 6, 2023

Uh oh!

harikrishna-patnala commented Dec 7, 2023

Uh oh!

harikrishna-patnala Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harikrishna-patnala commented Dec 7, 2023

Uh oh!

harikrishna-patnala commented Dec 7, 2023

Uh oh!

blueorangutan commented Dec 7, 2023

Uh oh!

codecov Bot commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

blueorangutan commented Dec 7, 2023

Uh oh!

DaanHoogland commented Dec 7, 2023

Uh oh!

blueorangutan commented Dec 7, 2023

Uh oh!

shwstppr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

blueorangutan commented Dec 7, 2023

Uh oh!

shwstppr commented Dec 8, 2023

harikrishna-patnala commented Oct 17, 2023 •

edited

Loading

kiranchavala left a comment •

edited

Loading

harikrishna-patnala Dec 7, 2023 •

edited

Loading

codecov Bot commented Dec 7, 2023 •

edited

Loading