add mark_not_offload() interface for cpu_offload_v1#2770
add mark_not_offload() interface for cpu_offload_v1#2770lhb8125 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR wires up However, the V1 implementation contains a logic bug that makes it a no-op:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["mark_not_offload(*tensors) called\n(cpu_offload.py)"] --> B{NVTE_CPU_OFFLOAD_V1?}
B -- Yes --> C["v1_code_path.mark_not_offload(*tensors)\n(cpu_offload_v1.py)"]
B -- No --> D["prepare_for_saving(*tensors)"]
D --> E["setattr(tensor, '_TE_do_not_offload', True)"]
E --> F["restore_from_saved(tensor_obj, tensors)"]
F --> G["Non-V1 push_tensor checks\n_TE_do_not_offload ✅"]
C --> H["prepare_for_saving(*tensors)"]
H --> I["setattr(tensor, '_TE_do_not_offload', True)"]
I --> J["restore_from_saved(tensor_obj, tensors)"]
J --> K["V1 tensor_need_offloading_checker\nchecks activation_offloading only ❌\n_TE_do_not_offload is never read"]
style K fill:#ffcccc,stroke:#cc0000
style G fill:#ccffcc,stroke:#009900
Last reviewed commit: 730d7a1 |
| def mark_not_offload(*tensors: torch.Tensor): | ||
| """Marks tensors to prevent them from being offloaded.""" | ||
| tensors, tensor_obj = prepare_for_saving(*tensors) | ||
|
|
||
| for tensor in tensors: | ||
| if tensor is not None: | ||
| setattr(tensor, "_TE_do_not_offload", True) | ||
|
|
||
| restore_from_saved(tensor_obj, tensors) |
There was a problem hiding this comment.
_TE_do_not_offload is never checked in V1 path — function is a no-op
The V1 code path decides whether to offload a tensor using tensor_need_offloading_checker_activations, which checks only for hasattr(tensor, "activation_offloading") (see cpu_offload_v1.py line 738–739). The attribute _TE_do_not_offload set here is never consulted in the V1 path, so calling this function after mark_activation_offload will not prevent offloading.
To confirm: the only place _TE_do_not_offload is read in the whole codebase is cpu_offload.py line 463 (OffloadableLayerState._check_if_offload), which belongs to the non-V1 code path.
For mark_not_offload to actually work in V1, the implementation should mirror how mark_activation_offload operates — i.e., remove (or otherwise gate on) the activation_offloading attribute:
def mark_not_offload(*tensors: torch.Tensor):
"""Marks tensors to prevent them from being offloaded."""
for tensor in tensors:
if tensor is None:
continue
if type(tensor) in [torch.Tensor, torch.nn.Parameter]:
if hasattr(tensor, "activation_offloading"):
del tensor.activation_offloading
else:
data_tensors = tensor.get_data_tensors()
for t in data_tensors:
if t is not None and hasattr(t, "activation_offloading"):
del t.activation_offloadingAlternatively, tensor_need_offloading_checker_activations could be updated to also gate on _TE_do_not_offload, but that would be a more invasive change.
|
/te-ci pytorch L1 |
Description
This PR adds an the interface of mark_not_offload() for cpu_offload v1 version, so MCore's offloading feature could reuse the mark_not_offload() interface to avoid offloading some tensors.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: