Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions superbench/common/utils/device_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,10 +389,17 @@ def get_device_power(self, idx):
"""
try:
power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx])
# amdsmi sets fields to 'N/A' when the hardware reports 0xFFFF (unsupported).
# On MI300X, average_socket_power is unsupported, so fall back to current_socket_power.
power = power_measure.get('average_socket_power')
if not isinstance(power, (int, float)):
power = power_measure.get('current_socket_power')
if not isinstance(power, (int, float)):
return None
Comment on lines +394 to +398
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new numeric validation (isinstance(..., (int, float))) is stricter than the previous int(...) cast and will reject many valid numeric types (e.g., numpy scalars, ctypes ints) even though int(value) would work. Consider using numbers.Real/numbers.Number for the type check, or just attempt int(power) in a small try/except and only fall back / return None on conversion failure.

Copilot uses AI. Check for mistakes.
return int(power)
Comment on lines +394 to +399
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dict.get() here changes behavior vs the prior code: if the expected key is missing, the function now returns None without logging a warning (because no exception is raised). If missing keys indicate an amdsmi schema/version mismatch, consider explicitly checking for key presence / unexpected values and logging a warning before returning None so monitoring failures are diagnosable.

Copilot uses AI. Check for mistakes.
Comment on lines +392 to +399
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new fallback behavior for ROCm power fields (handling 'N/A' and falling back to current_socket_power) is currently untested. Please add a unit test that mocks amdsmi_get_power_info to cover: (1) average_socket_power='N/A' falling back, and (2) both fields unsupported returning None.

Copilot uses AI. Check for mistakes.
except Exception as err:
logger.warning('Get device power failed: {}'.format(str(err)))
return None
return int(power_measure['average_socket_power'])

def get_device_power_limit(self, idx):
"""Get the power management limit of device, unit: watt.
Expand All @@ -405,10 +412,16 @@ def get_device_power_limit(self, idx):
"""
try:
power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx])
power_limit = power_measure.get('power_limit')
if not isinstance(power_limit, (int, float)):
return None
# amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts.
if power_limit > 100000:
power_limit = power_limit // 1000000
Comment on lines +418 to +420
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The microwatts-to-watts conversion heuristic uses an unexplained magic threshold (> 100000). This is hard to reason about and could silently mis-handle other unit conventions. Prefer a named constant (e.g., MICROWATTS_PER_WATT = 1_000_000) and a clearer conversion rule (always convert if amdsmi is known to return µW, or detect by comparing against MICROWATTS_PER_WATT).

Copilot uses AI. Check for mistakes.
return int(power_limit)
Comment on lines +415 to +421
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The µW→W normalization for power_limit should be covered by a unit test (mock amdsmi_get_power_info) to ensure values like 750000000 are converted to 750 and non-numeric/'N/A' values return None.

Copilot uses AI. Check for mistakes.
except Exception as err:
logger.warning('Get device power limit failed: {}'.format(str(err)))
return None
return int(power_measure['power_limit'])

def get_device_memory(self, idx):
"""Get the memory information of device, unit: byte.
Expand Down
Loading