Add MIG partition validation and defaults tests for all instance types#400
Open
Add MIG partition validation and defaults tests for all instance types#400
Conversation
Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py with the correct B300 MIG profiles derived from the NVIDIA GPU Operator v25.3.0 upstream ConfigMap (device-filter 0x318210DE): - mig-1g.34gb, mig-1g.67gb, mig-2g.67gb - mig-3g.135gb, mig-4g.135gb, mig-7g.269gb Also add the corresponding uniform and mixed MIG partition profiles to the Helm chart default-mig-config.yaml ConfigMap, following the same pattern used for existing GPU types (H100, H200, B200). The B300 GPU (288GB HBM3e, ~269GB usable) was already registered in INSTANCE_RESOURCES but had no MIG profile mapping, causing HyperPod MIG validation to reject accelerator partition requests on this instance type.
Covers ml.p6-b300.48xlarge MIG profile support added in PR aws#398: - Profile presence in INSTANCE_TYPE_MIG_PROFILES - Complete profile list verification (6 profiles) - All profiles in ALLOWED_ACCELERATOR_PARTITION_TYPES - GPU slice extraction for all B300 profiles (1g→1, 2g→2, ..., 7g→7) - CPU/memory default calculation for each profile at max instances - Validation acceptance for valid B300 profiles - Validation rejection for invalid profiles on B300 instance type
- Delete test_b300_in_instance_type_mig_profiles (subsumed by test_b300_profiles_complete which KeyErrors on missing key) - Delete test_b300_profiles_in_allowed_set (tautological: the allowed set is computed as union of all profile values) - Delete test_extract_gpu_slices_b300 (instance-type-agnostic regex already covered by existing parametrized tests) - Replace > 0 assertions with exact expected values in test_accelerator_partition_defaults_b300 - Fix misleading mock in test_validate_b300_partition: use empty allocatable for the invalid-profile case since validation fails at static parameter check before cluster check - Remove unused ALLOWED_ACCELERATOR_PARTITION_TYPES import
Eliminate the separate TestB300MigProfiles class. B300 tests now extend the existing parametrized cases in TestAcceleratorPartitionUtil: - B300 valid/invalid profile cases added to test_validate_accelerator_partition_fields - B300 defaults with exact values added to test_accelerator_partition_defaults (instance-type-parametrized) - test_instance_type_profiles_not_empty iterates all instance types in INSTANCE_TYPE_MIG_PROFILES as a data-driven guard This pattern scales to future instance types without adding new test classes.
Add B200 (Blackwell) test coverage alongside B300: - 2 validation cases: valid profile accepted, cross-arch rejected - 6 defaults cases with exact CPU/memory values B200 validation tests will fail until aws#399 merges (fixes the p6-b200.48xlarge → ml.p6-b200.48xlarge key). B200 defaults tests pass immediately since INSTANCE_RESOURCES already uses the ml. key.
Replace 12 B200/B300-only rows with 1 representative row per MIG-capable instance type (P4d, P4de, P5, P5e, P5en, B200, B300, GB200, G7e). Each row uses the smallest profile at max instance count, verifying that INSTANCE_RESOURCES has correct cpu/gpu/memory values for the ratio calculation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds regression tests for MIG accelerator partition validation and CPU/memory default calculation across all MIG-capable instance types. All tests extend the existing
TestAcceleratorPartitionUtilparametrized cases.Motivation
Without these tests, the following regressions would only surface as customer-reported runtime failures:
INSTANCE_TYPE_MIG_PROFILES— the CLI rejects valid MIG requests with"Instance type does not support accelerator partitions". The B200 validation tests in this PR demonstrate this: they fail on the currentmainbranch because theml.prefix is missing (fixed by Fix missing ml. prefix and wrong MIG profiles for p6-b200.48xlarge #399).INSTANCE_RESOURCESvalues — CPU/memory auto-calculation depends on correct instance specs (cpu, gpu, memory). A typo would silently mis-provision pod resources. The defaults test covers all 9 MIG-capable instance types with exact expected values.Depends on
ml.prefix for B200 (2 B200 validation tests fail without it)INSTANCE_TYPE_MIG_PROFILESand ConfigMapMerge order: #399 → #398 → this PR
Test coverage
test_validate_accelerator_partition_fieldstest_accelerator_partition_defaultstest_instance_type_profiles_not_emptyINSTANCE_TYPE_MIG_PROFILEShas ≥1 profileTest plan