test: Add e2e tests for GPU ECC NPD health checks#8115
Open
ganeshkumarashok wants to merge 1 commit intomainfrom
Open
test: Add e2e tests for GPU ECC NPD health checks#8115ganeshkumarashok wants to merge 1 commit intomainfrom
ganeshkumarashok wants to merge 1 commit intomainfrom
Conversation
Adds AgentBaker e2e tests for the new GPU ECC Node Problem Detector plugin (check_gpu_ecc.sh / custom-plugin-gpu-ecc.json), covering: - ValidateNPDGPUECCPlugin: verifies the plugin config file is present - ValidateNPDGPUECCCondition: polls GPUECCError node condition, expects GPUECCErrorIsNotPresent / ConditionFalse when hardware is healthy - ValidateNPDGPUECCConditionAfterFailure: injects a fault (replaces the plugin script with one that exits 1 with NHC2019 output), waits for NPD to report GPUECCErrorIsPresent / ConditionTrue, then restores Tests run as part of runScenarioGPUNPD on ECC-capable SKUs: Standard_ND96isr_H100_v5 (UAE North) Standard_ND96asr_v4 (South Central US)
Contributor
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds new e2e validators to exercise the GPU ECC Node Problem Detector (NPD) custom plugin on ECC-capable GPU SKUs, including validating the plugin config, the healthy condition state, and a fault-injection path.
Changes:
- Added validators to confirm the GPU ECC NPD plugin config file exists and that the healthy
GPUECCErrorcondition is reported. - Added a fault-injection validator that temporarily replaces the ECC check script to force NPD to surface an error condition, then restores the original script.
- Wired the new validators into the existing
runScenarioGPUNPDflow.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| e2e/validators.go | Adds GPU ECC NPD plugin/condition validators plus a fault-injection/restore flow. |
| e2e/test_helpers.go | Runs the new GPU ECC validators as part of the GPU NPD scenario. |
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+1085
to
+1089
| validateNPDCondition(ctx, s, "GPUECCError", "GPUECCErrorIsPresent", corev1.ConditionTrue, | ||
| "GPU DRAM ECC errors detected. FaultCode: NHC2019", "expected GPUECCError message to indicate DRAM ECC errors") | ||
|
|
||
| // Restore the original script | ||
| restoreCmd := []string{ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds AgentBaker e2e tests for the new GPU ECC Node Problem Detector plugin (check_gpu_ecc.sh / custom-plugin-gpu-ecc.json), covering:
Tests run as part of runScenarioGPUNPD on ECC-capable SKUs:
Standard_ND96isr_H100_v5 (UAE North)
Standard_ND96asr_v4 (South Central US)
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #