[CI] Refactor CI build on GitHub by ptrendx · Pull Request #2723 · NVIDIA/TransformerEngine

ptrendx · 2026-03-02T19:51:31Z

Description

Run the build in the minimal container and use the pip wheel based CUDA to minimize the used disk space.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps · 2026-03-10T18:06:33Z

Greptile Summary

This PR refactors the GitHub Actions CI build system by replacing the Docker-in-Docker + sccache approach with pre-built minimal containers (using pip-wheel-based CUDA) and native ccache via actions/cache, aiming to reduce disk usage and build complexity. New Dockerfile.pytorch and Dockerfile.all define the pre-built CI images, build_tools/utils.py gains get_cuda_library_dirs() for wheel-based CUDA library discovery, and the NCCL include is restructured across logging.h and cgemm_helper.h.

Key observations:

The pytorch and all CI jobs pull containers tagged :latest from a personal GHCR account (ghcr.io/ptrendx/*), making builds non-reproducible and dependent on an account outside the nvidia org.
CUDA_PATH is hardcoded to the cu13 subdirectory in both build.yml and the Dockerfiles; a future CUDA version bump will require a manual update to avoid a silent breakage.
get_cuda_library_dirs() carries an incorrect -> Tuple[str, str] return-type annotation (it returns List[Path]), copied from get_cuda_include_dirs.
Moving #include "nccl.h" inside the NVTE_WITH_CUBLASMP guard in logging.h while leaving NVTE_CHECK_NCCL defined outside it creates a latent compilation hazard for any future code that includes logging.h and uses the macro without separately including nccl.h.

Confidence Score: 3/5

PR is functional but introduces reproducibility risks in CI and a latent C++ header guard issue worth resolving before merging.
The CI refactor is a meaningful improvement in build simplicity and disk usage. However, the use of mutable :latest image tags under a personal GHCR account for the pytorch and all jobs undermines build reproducibility and auditability. The NCCL header restructure in logging.h works today but leaves a trap for future contributors. The incorrect return-type annotation is a minor but concrete error in new code.
.github/workflows/build.yml (mutable container tags, hardcoded cu13), transformer_engine/common/util/logging.h (NVTE_CHECK_NCCL outside NVTE_WITH_CUBLASMP guard)

Important Files Changed

Filename	Overview
.github/workflows/build.yml	Major CI refactor: replaces Docker-in-Docker + sccache with pre-built containers + ccache; pytorch/all jobs now use mutable `:latest` tags under a personal GHCR account, and CUDA paths are hardcoded to `cu13`.
build_tools/ci/Dockerfile.pytorch	New Dockerfile for the PyTorch CI container; installs pip-wheel CUDA, creates unversioned symlinks for libcudart/libcublas/libnccl, but is missing an unversioned libcudnn symlink (previously raised in review thread).
build_tools/ci/Dockerfile.all	New Dockerfile for the combined PyTorch+JAX CI container; same symlink structure as Dockerfile.pytorch plus JAX installation; same missing libcudnn.so symlink concern.
build_tools/utils.py	Adds `get_cuda_library_dirs()` for wheel-based CUDA lib discovery and a CUDA 13 package name fallback in `cuda_version()`; the new function carries an incorrect `Tuple[str,str]` return-type annotation.
build_tools/jax.py	Wires `get_cuda_library_dirs()` into the JAX Pybind11 extension so linker can find NCCL and other CUDA libraries when building against pip wheels; change is correct and self-contained.
transformer_engine/common/util/logging.h	Moves `#include "nccl.h"` inside the `NVTE_WITH_CUBLASMP` guard, but the `NVTE_CHECK_NCCL` macro (which uses NCCL types) remains defined unconditionally outside the guard, creating a latent compilation hazard.
transformer_engine/jax/csrc/extensions/cgemm_helper.h	Adds `#include <nccl.h>` to compensate for `nccl.h` no longer being unconditionally included via `logging.h`; keeps the only NCCL macro user (`cgemm_helper.cpp`) compiling correctly.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions Runner
    participant Cache as actions/cache (ccache)
    participant Container as Pre-built Container<br/>(ghcr.io/ptrendx/*)
    participant Pip as pip install
    participant CUDAWheels as NVIDIA CUDA Wheels<br/>(nvidia-cuda-nvcc, etc.)

    GH->>Container: Pull image (pytorch / all jobs)
    Container-->>GH: Container environment with<br/>torch + CUDA wheel libs + ccache

    GH->>Cache: Restore /root/.ccache
    Cache-->>GH: Cached compiler objects (if any)

    GH->>Pip: pip install --no-build-isolation .<br/>NVTE_BUILD_USE_NVIDIA_WHEELS=1
    Pip->>CUDAWheels: get_cuda_library_dirs() → nvidia/*/lib
    CUDAWheels-->>Pip: Library paths (libcudart, libcublas, libnccl …)
    Pip-->>GH: Built transformer_engine wheel

    GH->>Cache: Save updated /root/.ccache

    GH->>GH: python3 tests/*/test_sanity_import.py

Comments Outside Diff (1)

transformer_engine/common/util/logging.h, line 109-115 (link)

NVTE_CHECK_NCCL defined outside NVTE_WITH_CUBLASMP guard despite nccl.h being moved inside it

After this PR, #include "nccl.h" only appears inside #ifdef NVTE_WITH_CUBLASMP ... #endif (lines 15–19). However, the NVTE_CHECK_NCCL macro, which expands to code using ncclResult_t, ncclSuccess, and ncclGetErrorString, is still defined unconditionally outside that guard.

This currently works only because the sole caller (cgemm_helper.cpp) pulls in nccl.h via cgemm_helper.h before it includes logging.h. Any future file that includes logging.h and expands NVTE_CHECK_NCCL without also including nccl.h first will fail to compile with undefined NCCL types.

To make the API self-contained and prevent future breakage, consider moving the NVTE_CHECK_NCCL macro inside the NVTE_WITH_CUBLASMP guard (matching where nccl.h now lives), or restoring the unconditional #include "nccl.h" and keeping cgemm_helper.h's additional include as an explicit dependency of that header only.

_{Last reviewed commit: f3d0aa6}

build_tools/utils.py

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps · 2026-03-11T04:46:33Z

build_tools/ci/Dockerfile.pytorch

+
+# Create symlinks for CUDA libraries
+RUN CUDA_PATH=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cu13')") && \
+    ln -s $CUDA_PATH/lib/libcudart.so.13 $CUDA_PATH/lib/libcudart.so && \
+    ln -s $CUDA_PATH/lib/libcublas.so.13 $CUDA_PATH/lib/libcublas.so && \
+    ln -s $CUDA_PATH/../nccl/lib/libnccl.so.2 $CUDA_PATH/../nccl/lib/libnccl.so && \
+    ln -s $CUDA_PATH/lib $CUDA_PATH/lib64


Missing libcudnn.so unversioned symlink

The symlink block creates unversioned aliases for libcudart, libcublas, and libnccl (needed for -lcudart, -lcublas, -lnccl at link time), but no equivalent symlink for libcudnn.so. The logging.h header unconditionally includes <cudnn.h>, and the build links against cudnn. If the cudnn pip wheel (typically pulled in as a torch dependency) only ships a versioned filename such as libcudnn.so.9, the linker will fail to resolve -lcudnn without an unversioned symlink.

Consider adding a similar symlink for cudnn, e.g.:

CUDNN_LIB=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cudnn/lib')") && \ ln -s $CUDNN_LIB/libcudnn.so.9 $CUDNN_LIB/libcudnn.so

The same applies to Dockerfile.all.

.github/workflows/build.yml

ksivaman · 2026-03-13T17:36:56Z

.github/workflows/build.yml

    name: 'PyTorch'
    runs-on: ubuntu-latest
+    container:
+      image: ghcr.io/ptrendx/te_gha_pytorch:latest


Need to change to use GH packages

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

greptile-apps · 2026-03-16T17:35:28Z

.github/workflows/build.yml

    steps:
-      - name: Move /var/lib/docker/
-        shell: bash -euxo pipefail {0}
-        run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker"
-
-      - name: Maximize build space
-        uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794
-        with:
-          root-reserve-mb: 5120
-          temp-reserve-mb: 32
-          swap-size-mb: 10240
-          remove-dotnet: 'true'
-          remove-android: 'true'
-          remove-haskell: 'true'
-          remove-codeql: 'true'
-          build-mount-path: '/var/lib/docker/'
-
-      - name: Restore /var/lib/docker/
-        shell: bash -euxo pipefail {0}
-        run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker"
-
      - name: 'Checkout'


Mutable :latest container tags break reproducibility

Both the pytorch and all jobs pull containers by their :latest tag (ghcr.io/ptrendx/te_gha_pytorch:latest and ghcr.io/ptrendx/te_gha_all:latest). Mutable tags mean:

The CI build can silently change behavior whenever the image is updated, making it very difficult to bisect regressions.

The images are hosted under a personal account (ptrendx) rather than the official nvidia org on GHCR — if account permissions change, CI breaks with no notice.

Consider pinning to an immutable digest, e.g.:

image: ghcr.io/ptrendx/te_gha_pytorch@sha256:<digest>

Or use a date-versioned tag (te_gha_pytorch:2026-03-16) that is regenerated on image updates, keeping reproducibility within a given CI run while still making updates explicit.

The same applies to the all job container at line 120.

greptile-apps · 2026-03-16T17:35:30Z

build_tools/utils.py



+@functools.lru_cache(maxsize=None)
+def get_cuda_library_dirs() -> Tuple[str, str]:


Incorrect return type annotation (copy-paste from get_cuda_include_dirs)

The return type is annotated as Tuple[str, str] but the function returns either an empty list or a List[Path] (via the list comprehension on line 274–278). This annotation is wrong and was likely copied from get_cuda_include_dirs.

Suggested change

def get_cuda_library_dirs() -> Tuple[str, str]:

def get_cuda_library_dirs() -> List[Path]:

greptile-apps · 2026-03-16T17:35:34Z

.github/workflows/build.yml

+        run: df -lh
+      - uses: actions/cache@v4
+        with:
+          path: /root/.ccache


Hardcoded cu13 CUDA path prefix

The CUDA_PATH and CUDNN_PATH exports hardcode the cu13 subdirectory name, which is tied to a specific CUDA major version. When CUDA 14 (or another version) is introduced, this will need a manual update to match the new cu14 subdirectory — making the failure mode invisible until the build breaks.

Consider deriving the path dynamically:

export CUDA_PATH=$(python3 -c "import nvidia, glob, os; paths=glob.glob(os.path.join(list(nvidia.__path__)[0], 'cu*')); print(sorted(paths)[-1])")

Or document the intentional coupling to CUDA 13 with a comment so the future update is obvious.

ptrendx and others added 24 commits February 24, 2026 15:16

Add the new CUDA package to the search for version

bc22b05

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Move nccl.h behind the cublasmp guard

a758953

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Try pytorch build

4c86fb0

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

b56daa2

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

a59b6ab

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

32fafa4

for more information, see https://pre-commit.ci

Maybe fix?

cd7bf8a

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

7e636cc

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Test

f033c8b

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

df606be

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix for cuDNN

edbc265

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Trying without sccache

2e66d4f

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

b373a86

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Maybe inside a clean container?

8a48eb7

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Test

13abe8f

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Add packaging missing dependency and ccache

f0714ef

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix

47fa3f1

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Changed All

5e26ea2

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

68a3c70

for more information, see https://pre-commit.ci

Add ccache to jax build

8cd53db

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fixes

d250403

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

27218da

for more information, see https://pre-commit.ci

Fix

3949f12

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge branch 'main' into pr_github_build

da09ded

ptrendx marked this pull request as ready for review March 10, 2026 18:02

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

build_tools/utils.py Outdated Show resolved Hide resolved

build_tools/utils.py Outdated Show resolved Hide resolved

ptrendx added 2 commits March 10, 2026 21:42

Start of custom containers

b39e385

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Test with prebuilt containers

42c7836

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

ksivaman reviewed Mar 13, 2026

View reviewed changes

.github/workflows/build.yml Show resolved Hide resolved

ksivaman reviewed Mar 13, 2026

View reviewed changes

Update build_tools/utils.py

f3d0aa6

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Refactor CI build on GitHub#2723

[CI] Refactor CI build on GitHub#2723
ptrendx wants to merge 27 commits intoNVIDIA:mainfrom
ptrendx:pr_github_build

ptrendx commented Mar 2, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 10, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Uh oh!

ksivaman Mar 13, 2026

Uh oh!

greptile-apps bot Mar 16, 2026

Uh oh!

greptile-apps bot Mar 16, 2026

Uh oh!

greptile-apps bot Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@functools.lru_cache(maxsize=None)
		def get_cuda_library_dirs() -> Tuple[str, str]:

	def get_cuda_library_dirs() -> Tuple[str, str]:
	def get_cuda_library_dirs() -> List[Path]:

Conversation

ptrendx commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ksivaman Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ptrendx commented Mar 2, 2026 •

edited

Loading

greptile-apps bot commented Mar 10, 2026 •

edited

Loading