Skip to content

Handle errors while scaling kubernetes cluster#8107

Merged
shwstppr merged 4 commits intoapache:4.18from
shapeblue:HandleScaleK8sError
Dec 12, 2023
Merged

Handle errors while scaling kubernetes cluster#8107
shwstppr merged 4 commits intoapache:4.18from
shapeblue:HandleScaleK8sError

Conversation

@harikrishna-patnala
Copy link
Copy Markdown
Member

@harikrishna-patnala harikrishna-patnala commented Oct 17, 2023

Description

This PR fixes the issue #7920

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Issue #7920 happened because of the unhandled CloudRuntime Exception, I've replicated the same scenario even when Global setting enable.dynamic.scale.vm is set to false.

Tested the same before and after fix

Before fix, state stuck in "scaling"

Screenshot 2023-10-17 at 10 25 54 AM

After fix, state first changed to "Alert" and then moved to "Running"

Screenshot 2023-10-17 at 10 45 51 AM

Also tested multiple other operations on the K8s cluster

  1. Scale up the cluster with node count
  2. Scale down the cluster with node count
  3. Combinations of node count and service offering change

Those are all worked fine.

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@harikrishna-patnala a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7393

@shwstppr shwstppr added this to the 4.18.2.0 milestone Oct 17, 2023
@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@blueorangutan test rocky8 vmware-67u3

@blueorangutan
Copy link
Copy Markdown

@harikrishna-patnala a [SF] Trillian-Jenkins test job (rocky8 mgmt + vmware-67u3) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-7993)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 50614 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t7993-vmware-67u3.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 3603.69 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 4.38 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 4.40 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 4.46 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 12.56 test_vm_deployment_planner.py

Copy link
Copy Markdown
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - needs testing.

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM but needs testing. In my original issue, dynamic scaling config was set to true.
It was the hypervisor plugin that was returning error during scaling. Using same vCPU and 2GB to 4GB scaling should reproduce the original issue on VMware.
cc @harikrishna-patnala @kiranchavala

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good but I wonder whether this will improve the situation, because of the now omitted exception. needs testing

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harikrishna-patnala found the following issue with scaling of k8s cluster when these steps are followed

  1. Set the global setting "enable.dynamic.scale.vm" is set to false

  2. Create a k8s cluster with 1 worker node

  3. Scale the k8s cluster to 2 worker node with the same offering

  4. The scaling of the k8s cluster is a success

  5. Create a new compute offering

  6. Scale the k8s cluster to the new compute offering and make the worker size to 1 (downscale it)

There is an internal server error

2023-10-20 07:17:55,354 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-35:ctx-dc622ce0 job-57) (logid:1fe1d102) Unexpected exception while executing org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd
java.lang.NullPointerException
        at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleKubernetesClusterOffering(KubernetesClusterScaleWorker.java:285)
        at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterScaleWorker.scaleCluster(KubernetesClusterScaleWorker.java:460)
        at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.scaleKubernetesCluster(KubernetesClusterManagerImpl.java:1328)
        at org.apache.cloudstack.api.command.user.kubernetes.cluster.ScaleKubernetesClusterCmd.execute(ScaleKubernetesClusterCmd.java:156)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829
  1. The k8s cluster is struck in scaling state .

Ideally we should also not allow scaling to the k8s cluster to the same compute offering as the "enable.dynamic.scale.vm" is set to false"

also, we should not allow down scaling of the cluster

scaling

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

Code LGTM but needs testing. In my original issue, dynamic scaling config was set to true. It was the hypervisor plugin that was returning error during scaling. Using same vCPU and 2GB to 4GB scaling should reproduce the original issue on VMware. cc @harikrishna-patnala @kiranchavala

In both cases, CloudRuntimeException is thrown @shwstppr , so this covers that error as well.

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

KubernetesClusterScaleWorker.java:460

Thanks for testing @kiranchavala .
For sure this error or this case is not part of this PR but worth checking the case and fix it. I'll test it again with your reproduction steps.

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Nov 8, 2023

@harikrishna-patnala will you be able to make changes for cases reported by @kiranchavala ?

@DaanHoogland
Copy link
Copy Markdown
Contributor

@harikrishna-patnala will you be able to make changes for cases reported by @kiranchavala ?

and if you can't, is it worth merging without?/will you create a new issue for it?

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

checking it now, I'll see if that case can be covered here.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@harikrishna-patnala any progress/prognosis?

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Dec 5, 2023

@harikrishna-patnala any update on this?

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

Sorry @DaanHoogland and @shwstppr could not finish this completely. Resuming it now.

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

I'm seeing some other cases where the cluster is stuck in "Scaling" state. Trying to fix them as well.

@Override
public KubernetesClusterVO doInTransaction(TransactionStatus status) {
KubernetesClusterVO updatedCluster = kubernetesClusterDao.createForUpdate(kubernetesCluster.getId());
KubernetesClusterVO updatedCluster = kubernetesClusterDao.findById(kubernetesCluster.getId());
Copy link
Copy Markdown
Member Author

@harikrishna-patnala harikrishna-patnala Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bigger change because, I observed createForUpdate() is returning an object with all null or default values and assumption is that we have to update that entry for all the columns.

This is causing few issues with scaling operations

  1. When I tried to change the node count and change the compute offering at the same time. Compute offering change on few nodes is missing. This happened because of above code
  2. Above issue is causing after effects where I could not change the compute offering of the cluster anymore as there are differences in the compute offerings of the nodes
  3. The state issue, causing NPE (the actual bug raised here)

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

This is ready for review cc @kiranchavala @shwstppr @DaanHoogland

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 7, 2023

Codecov Report

Attention: 900 lines in your changes are missing coverage. Please review.

Comparison is base (29c7b31) 13.02% compared to head (ccf9d1e) 13.11%.
Report is 115 commits behind head on 4.18.

Files Patch % Lines
...n/java/com/cloud/network/IpAddressManagerImpl.java 11.01% 105 Missing ⚠️
...ava/com/cloud/upgrade/dao/Upgrade41800to41810.java 1.05% 94 Missing ⚠️
...java/com/cloud/agent/manager/AgentManagerImpl.java 0.00% 66 Missing ⚠️
...src/main/java/com/cloud/upgrade/GuestOsMapper.java 20.73% 65 Missing ⚠️
...ud/hypervisor/kvm/storage/KVMStorageProcessor.java 0.00% 47 Missing and 1 partial ⚠️
...ain/java/com/cloud/api/query/QueryManagerImpl.java 0.00% 35 Missing ⚠️
...ain/java/com/cloud/storage/dao/GuestOSDaoImpl.java 0.00% 28 Missing ⚠️
...in/java/com/cloud/server/ManagementServerImpl.java 0.00% 24 Missing ⚠️
.../apache/cloudstack/vm/UnmanagedVMsManagerImpl.java 48.93% 19 Missing and 5 partials ⚠️
...va/org/apache/cloudstack/ldap/LdapManagerImpl.java 0.00% 20 Missing ⚠️
... and 59 more
Additional details and impacted files
@@             Coverage Diff              @@
##               4.18    #8107      +/-   ##
============================================
+ Coverage     13.02%   13.11%   +0.09%     
- Complexity     9032     9134     +102     
============================================
  Files          2720     2720              
  Lines        257080   257674     +594     
  Branches      40088    40174      +86     
============================================
+ Hits          33476    33803     +327     
- Misses       219400   219579     +179     
- Partials       4204     4292      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7962

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Will need testing

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-8499)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40740 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t8499-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@shwstppr
Copy link
Copy Markdown
Contributor

shwstppr commented Dec 8, 2023

@blueorangutan package

1 similar comment
@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7986

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@harikrishna-patnala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-8534)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40913 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8107-t8534-kvm-centos7.zip
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 45.83 test_vm_life_cycle.py

@harikrishna-patnala
Copy link
Copy Markdown
Member Author

@kiranchavala would you like to continue testing this and also please verify the issue that you've raised

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested the fix by @harikrishna-patnala and it is working fine

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@shwstppr shwstppr merged commit 3ce7c39 into apache:4.18 Dec 12, 2023
@shwstppr shwstppr deleted the HandleScaleK8sError branch December 12, 2023 11:27
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants