Fix missing activation checkpointing (recompute) parameters in bridge mode by XJL010622 · Pull Request #1833 · THUDM/slime

XJL010622 · 2026-04-14T03:06:21Z

Motivation

When using megatron_to_hf_mode == "bridge", the AutoBridge.from_hf_pretrained() method generates a model provider based strictly on the HuggingFace config.json. However, HF configurations only define the static model architecture and do not contain training-specific memory optimization arguments such as activation checkpointing (recompute).

Consequently, critical arguments like recompute_granularity are lost during the provider initialization. This causes activation checkpointing to fail silently, leading to unexpected and severe OOM (Out of Memory) errors during training, especially for large models or long context windows.

Modifications

This PR explicitly synchronizes the recompute-related parameters from the command-line args to the provider before provider.finalize() is called.

We use a safe iteration over hasattr(args, ...) to ensure compatibility even if certain recompute arguments are not passed in the specific launch script.

Changed Code snippet (for review)

In get_model_provider_func (inside the bridge conditional branch):

        provider.variable_seq_lengths = args.variable_seq_lengths
        if hasattr(args, "moe_token_dispatcher_type"):
            provider.moe_token_dispatcher_type = args.moe_token_dispatcher_type

        # --- NEW CODE ADDED HERE ---
        # Explicitly sync activation checkpointing parameters since HF config does not contain them
        recompute_fields = (
            "recompute_granularity",
            "recompute_method",
            "recompute_num_layers"
        )
        for field in recompute_fields:
            if hasattr(args, field) and getattr(args, field) is not None:
                setattr(provider, field, getattr(args, field))
        # ---------------------------

        if getattr(args, "decoder_first_pipeline_num_layers", None) is not None:

unknown and others added 2 commits April 14, 2026 11:01

Fix missing activation checkpointing parameters in bridge mode

847e611

Merge branch 'main' into fix-bridge-recompute

f4851fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing activation checkpointing (recompute) parameters in bridge mode#1833

Fix missing activation checkpointing (recompute) parameters in bridge mode#1833
XJL010622 wants to merge 2 commits intoTHUDM:mainfrom
XJL010622:fix-bridge-recompute

XJL010622 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XJL010622 commented Apr 14, 2026

Motivation

Modifications

Changed Code snippet (for review)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant