Skip to content

test: Add e2e tests for GPU ECC NPD health checks#8115

Open
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/gpu-ecc-e2e-v2
Open

test: Add e2e tests for GPU ECC NPD health checks#8115
ganeshkumarashok wants to merge 1 commit intomainfrom
aganeshkumar/gpu-ecc-e2e-v2

Conversation

@ganeshkumarashok
Copy link
Contributor

Adds AgentBaker e2e tests for the new GPU ECC Node Problem Detector plugin (check_gpu_ecc.sh / custom-plugin-gpu-ecc.json), covering:

  • ValidateNPDGPUECCPlugin: verifies the plugin config file is present
  • ValidateNPDGPUECCCondition: polls GPUECCError node condition, expects GPUECCErrorIsNotPresent / ConditionFalse when hardware is healthy
  • ValidateNPDGPUECCConditionAfterFailure: injects a fault (replaces the plugin script with one that exits 1 with NHC2019 output), waits for NPD to report GPUECCErrorIsPresent / ConditionTrue, then restores

Tests run as part of runScenarioGPUNPD on ECC-capable SKUs:
Standard_ND96isr_H100_v5 (UAE North)
Standard_ND96asr_v4 (South Central US)

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Adds AgentBaker e2e tests for the new GPU ECC Node Problem Detector
plugin (check_gpu_ecc.sh / custom-plugin-gpu-ecc.json), covering:
- ValidateNPDGPUECCPlugin: verifies the plugin config file is present
- ValidateNPDGPUECCCondition: polls GPUECCError node condition, expects
  GPUECCErrorIsNotPresent / ConditionFalse when hardware is healthy
- ValidateNPDGPUECCConditionAfterFailure: injects a fault (replaces the
  plugin script with one that exits 1 with NHC2019 output), waits for
  NPD to report GPUECCErrorIsPresent / ConditionTrue, then restores

Tests run as part of runScenarioGPUNPD on ECC-capable SKUs:
  Standard_ND96isr_H100_v5 (UAE North)
  Standard_ND96asr_v4 (South Central US)
Copilot AI review requested due to automatic review settings March 18, 2026 00:48
@ganeshkumarashok ganeshkumarashok changed the title Add e2e tests for GPU ECC NPD health checks test: Add e2e tests for GPU ECC NPD health checks Mar 18, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

PR Title Lint Failed ❌

Current Title: test: Add e2e tests for GPU ECC NPD health checks

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new e2e validators to exercise the GPU ECC Node Problem Detector (NPD) custom plugin on ECC-capable GPU SKUs, including validating the plugin config, the healthy condition state, and a fault-injection path.

Changes:

  • Added validators to confirm the GPU ECC NPD plugin config file exists and that the healthy GPUECCError condition is reported.
  • Added a fault-injection validator that temporarily replaces the ECC check script to force NPD to surface an error condition, then restores the original script.
  • Wired the new validators into the existing runScenarioGPUNPD flow.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
e2e/validators.go Adds GPU ECC NPD plugin/condition validators plus a fault-injection/restore flow.
e2e/test_helpers.go Runs the new GPU ECC validators as part of the GPU NPD scenario.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +1085 to +1089
validateNPDCondition(ctx, s, "GPUECCError", "GPUECCErrorIsPresent", corev1.ConditionTrue,
"GPU DRAM ECC errors detected. FaultCode: NHC2019", "expected GPUECCError message to indicate DRAM ECC errors")

// Restore the original script
restoreCmd := []string{
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants