[PyTorch][Fused Attn] Add support for cuDNN to return Softmax `Stats` always and `Max` when `return_max_logit=True` by sudhakarsingh27 · Pull Request #2677 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-02-12T23:30:59Z

Description

cuDNN recently made returning any subset of {Stats, SumExp, Max} possible. This PR adapts TE to always get Stats from cuDNN and Max tensor if return_max_logit=True. (Note that Stats = log(SumExp)+Max)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

fused_attn_f16_arbitrary_seqlen.cu
- Removed references to SumExp tensor as it's not needed since cuDNN returns Stats by default.
- set generate_stats=True which forces cuDNN to always return Stats tensor (needed in the backward pass)
transformer_engine/pytorch/cpp_extensions/fused_attn.py
- Remove code that manually did Stats = log(SumExp) + Max since cuDNN returns Stats directly and TE doesn't need SumExp from cuDNN
Corresponding documentation

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-12T23:36:04Z

Greptile Summary

This PR adapts TransformerEngine's fused attention implementation to always request the Stats tensor (= log(SumExp) + Max) from cuDNN, and optionally also request the Max tensor when return_max_logit=True. Previously, cuDNN was either asked for Stats (training path) or {Max, SumExp} (return_max_logit path), and TE computed Stats manually from the latter pair. The new cuDNN frontend allows returning any subset, making the manual computation unnecessary.

Key changes:

generate_stats is now always true; set_generate_stats(true) is called unconditionally on the cuDNN SDPA graph.
The Aux_CTX_Tensors pack is restructured: Stats is always tensor [0], Max occupies tensor [1] only when return_max_logit=True, followed by rng_state and optional Bias/SoftmaxOffset.
The Python fused_attn_fwd wrapper no longer manually computes Stats = log(SumExp) + Max; it reads Stats directly from output_tensors[1] and constructs max_logit from output_tensors[2] (Max) when return_max_logit=True.
FADescriptor_v1::generate_max_sum_exp is renamed to return_max_logit for clarity, correctly remaining part of the graph cache key.
API documentation in fused_attn.h is updated at both return_max_logit parameter occurrences.
One minor cleanup opportunity: the generate_stats local variable (line 107) is now always true and could be inlined directly into the .set_generate_stats() call.

Confidence Score: 4/5

This PR is safe to merge; the tensor ordering changes are internally consistent across all C++ and Python layers.
The change is well-scoped: the Aux_CTX_Tensors pack layout is updated consistently in both the forward allocation pass (size==0) and the data pass (size>=2), the Python layer correctly routes Stats to aux_ctx_tensors and Max to max_logit, and the backward pass always reads Stats from index 0. No regression risk for backward compatibility since the new tensor ordering matches what both C++ callers and Python callers now expect. Score is 4 rather than 5 only because the submodule bump (cuDNN frontend) is not reviewable here, and no new tests were added.
No files require special attention; all layers are consistent with the new Stats-first tensor ordering.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu	Core change: `generate_stats` is now always `true`, Stats tensor is always returned first in `Aux_CTX_Tensors`, and Max tensor is included at index 1 only when `return_max_logit=True`. The tensor pack ordering is consistent between the size==0 allocation pass and the size>=2 data pass. `generate_stats` variable could be inlined as `true`, but this is cosmetic.
transformer_engine/common/fused_attn/utils.h	Renames `generate_max_sum_exp` to `return_max_logit` in `FADescriptor_v1`. The field is correctly used in the `operator<` comparison (cache key), so graph caching correctly differentiates runs with vs. without Max output.
transformer_engine/common/include/transformer_engine/fused_attn.h	Documentation updated at both `return_max_logit` parameter occurrences (lines 209 and 272) to reflect new semantics: "Whether to produce Max along with Stats."
transformer_engine/pytorch/cpp_extensions/fused_attn.py	Removes the manual `Stats = log(SumExp) + Max` computation; now constructs `aux_ctx_tensors = [Stats, rng_state, ...]` and `max_logit` from `output_tensors[2]` (Max). Tensor order matches the updated C++ output pack ordering.
transformer_engine/pytorch/csrc/extensions/attention.cpp	Updates the allocation comments and comment for the second auxiliary tensor. Allocation logic is unchanged: first tensor is always Stats (S/M), and Max is allocated as the second tensor when `return_max_logit=True`.
3rdparty/cudnn-frontend	Submodule bump from 8d19d31 to a5ca04f to pick up the cuDNN frontend support for returning Stats always and Max when requested.

Sequence Diagram

sequenceDiagram
    participant Py as fused_attn_fwd (Python)
    participant Cpp as attention.cpp (C++)
    participant CUDA as fused_attn_f16_arbitrary_seqlen.cu
    participant cuDNN as cuDNN Frontend

    Py->>Cpp: tex.fused_attn_fwd(..., return_max_logit)
    Cpp->>Cpp: Allocate aux_tensor_pack<br/>[0]=Stats, [1]=Max(if rml), [n]=rng_state
    Cpp->>CUDA: nvte_fused_attn_fwd(... Aux_CTX_Tensors)
    CUDA->>CUDA: generate_stats=true (always)
    CUDA->>cuDNN: sdpa with set_generate_stats(true)<br/>+ set_logit_max(Max) if return_max_logit
    cuDNN-->>CUDA: O, Stats, [Max if return_max_logit]
    CUDA-->>Cpp: Aux_CTX_Tensors filled:<br/>[Stats, [Max], rng_state, ...]
    Cpp-->>Py: output_tensors=[O, Stats, [Max], rng_state, ...]

    Note over Py: if return_max_logit:<br/>aux=[Stats, rng_state,...]<br/>max_logit=amax(Max)<br/>else:<br/>aux=output_tensors[1:]

    Py->>Cpp: tex.fused_attn_bwd(..., aux_ctx_tensors=[Stats, rng_state,...])
    Cpp->>CUDA: Aux_CTX_Tensors[0]=Stats, [1]=rng_state
    CUDA->>cuDNN: sdpaBwd(Stats as softmax_stats)
    cuDNN-->>CUDA: dQ, dK, dV

_{Last reviewed commit: ef0d7ec}

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/cpp_extensions/fused_attn.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T18:34:10Z

Additional Comments (1)

transformer_engine/pytorch/cpp_extensions/fused_attn.py
Stale docstring: wrong formula for softmaxStats

The public docstring still describes softmaxStats as log(sum(e^(x - max(x)))), which is log(SumExp). However, with this PR, the returned tensor is cuDNN's Stats = log(SumExp) + Max, not just log(SumExp). This formula was already incorrect before this PR (the old code computed Max + log(SumExp) and stored it as stats), but the PR is an opportunity to correct it.

                       softmaxStats: torch.Tensor
                           log(sum(e^(x - max(x)))) + max(x), where x=Q*K.T (i.e. Stats = log(SumExp) + Max)
                           shape [batch_size, num_heads, max_seqlen_q, 1], dtype float32

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

cyanguwa · 2026-02-18T22:16:00Z

transformer_engine/pytorch/cpp_extensions/fused_attn.py

-        stats = output_tensors[1] + torch.log(output_tensors[2])
+        # thd:  output_tensors: out [tq, h, d],    Stats [tq, h, 1],    Max [tq, h, 1]
+        # bshd: output_tensors: out [b, sq, h, d], Stats [b, h, sq, 1], Max [b, h, sq, 1]
+        # sbhd: output_tensors: out [sq, b, h, d], Stats [b, h, sq, 1], Max [b, h, sq, 1] (there's no typo here)


Do we need the "there's no typo here" :)

I deliberately added it because I didn't believe it and checked the shapes myself :P

transformer_engine/pytorch/csrc/extensions/attention.cpp

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/include/transformer_engine/fused_attn.h

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

transformer_engine/pytorch/cpp_extensions/fused_attn.py

…eturn_stats_max_cudnn

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/include/transformer_engine/fused_attn.h

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-20T19:45:15Z

Additional Comments (1)

transformer_engine/common/include/transformer_engine/fused_attn.h
Entire file has been reformatted with unintentional 3-space indentation changes. This creates a large diff unrelated to the actual feature changes. Revert the formatting to match the original file structure.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

sudhakarsingh27 · 2026-03-10T20:07:50Z

/te-ci L2

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eturn_stats_max_cudnn

sudhakarsingh27 and others added 5 commits February 12, 2026 13:12

cudnn now returns Stats always and Max only with return_max_logit=true

3fb19fc

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine

5b40701

fix a typo that caused a bug

5d479ad

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

update doc strings

296fb9f

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

24bfd45

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

transformer_engine/pytorch/cpp_extensions/fused_attn.py Outdated Show resolved Hide resolved

sudhakarsingh27 added 2 commits February 12, 2026 16:04

fix more docs

fd42feb

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'fix_return_stats_max_cudnn' of github.com:sudhakarsingh…

2d7b51b

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

260380b

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa February 17, 2026 18:28

Merge branch 'main' into fix_return_stats_max_cudnn

7a5ab35

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

9710810

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu Show resolved Hide resolved

cyanguwa reviewed Feb 18, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/attention.cpp Outdated Show resolved Hide resolved

cyanguwa reviewed Feb 18, 2026

View reviewed changes

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu Show resolved Hide resolved

cyanguwa requested changes Feb 18, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

07db752

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

transformer_engine/common/include/transformer_engine/fused_attn.h Outdated Show resolved Hide resolved

KshitijLakhani reviewed Feb 19, 2026

View reviewed changes

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu Show resolved Hide resolved

transformer_engine/pytorch/cpp_extensions/fused_attn.py Show resolved Hide resolved

sudhakarsingh27 added 2 commits February 19, 2026 16:53

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

f8b1a68

…eturn_stats_max_cudnn

Merge branch 'fix_return_stats_max_cudnn' of github.com:sudhakarsingh…

b5b2b9d

…27/TransformerEngine into fix_return_stats_max_cudnn

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

transformer_engine/common/include/transformer_engine/fused_attn.h Show resolved Hide resolved

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from 21ca43a to becc3ad Compare February 20, 2026 19:41

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

fixes from the feedback

8f40cab

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from d4568db to 8f40cab Compare February 20, 2026 20:00

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

56e46fd

…eturn_stats_max_cudnn

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

Merge branch 'main' into fix_return_stats_max_cudnn

1102738

greptile-apps bot reviewed Feb 23, 2026

View reviewed changes

sudhakarsingh27 added 4 commits February 26, 2026 10:29

merge main

7363541

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

8c0d6a1

…eturn_stats_max_cudnn

update cudnn-frontend to v1.19.1

d517a13

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

e005455

…eturn_stats_max_cudnn

sudhakarsingh27 force-pushed the fix_return_stats_max_cudnn branch from 2b64738 to e005455 Compare March 10, 2026 19:01

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa March 11, 2026 02:14

sudhakarsingh27 added the 2.14.0 label Mar 12, 2026

sudhakarsingh27 added 2 commits March 12, 2026 15:44

update the cudnn frontend

3ae0a34

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fix_r…

ef0d7ec

…eturn_stats_max_cudnn

cyanguwa approved these changes Mar 16, 2026

View reviewed changes

Conversation

sudhakarsingh27 commented Feb 12, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 17, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cyanguwa Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Feb 12, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading