-
Notifications
You must be signed in to change notification settings - Fork 85
Enhancement - Fix power monitoring for ROCm 6.3 #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -389,10 +389,17 @@ def get_device_power(self, idx): | |
| """ | ||
| try: | ||
| power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx]) | ||
| # amdsmi sets fields to 'N/A' when the hardware reports 0xFFFF (unsupported). | ||
| # On MI300X, average_socket_power is unsupported, so fall back to current_socket_power. | ||
| power = power_measure.get('average_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| power = power_measure.get('current_socket_power') | ||
| if not isinstance(power, (int, float)): | ||
| return None | ||
| return int(power) | ||
|
Comment on lines
+394
to
+399
|
||
| except Exception as err: | ||
| logger.warning('Get device power failed: {}'.format(str(err))) | ||
| return None | ||
| return int(power_measure['average_socket_power']) | ||
|
|
||
| def get_device_power_limit(self, idx): | ||
| """Get the power management limit of device, unit: watt. | ||
|
|
@@ -405,10 +412,16 @@ def get_device_power_limit(self, idx): | |
| """ | ||
| try: | ||
| power_measure = rocml.amdsmi_get_power_info(self._device_handlers[idx]) | ||
| power_limit = power_measure.get('power_limit') | ||
| if not isinstance(power_limit, (int, float)): | ||
| return None | ||
| # amdsmi returns power_limit in microwatts (e.g. 750000000 for 750W), convert to watts. | ||
| if power_limit > 100000: | ||
| power_limit = power_limit // 1000000 | ||
|
Comment on lines
+418
to
+420
|
||
| return int(power_limit) | ||
|
Comment on lines
+415
to
+421
|
||
| except Exception as err: | ||
| logger.warning('Get device power limit failed: {}'.format(str(err))) | ||
| return None | ||
| return int(power_measure['power_limit']) | ||
|
|
||
| def get_device_memory(self, idx): | ||
| """Get the memory information of device, unit: byte. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new numeric validation (
isinstance(..., (int, float))) is stricter than the previousint(...)cast and will reject many valid numeric types (e.g.,numpyscalars,ctypesints) even thoughint(value)would work. Consider usingnumbers.Real/numbers.Numberfor the type check, or just attemptint(power)in a small try/except and only fall back / return None on conversion failure.