Skip to content

[CI] Refactor CI build on GitHub#2723

Open
ptrendx wants to merge 27 commits intoNVIDIA:mainfrom
ptrendx:pr_github_build
Open

[CI] Refactor CI build on GitHub#2723
ptrendx wants to merge 27 commits intoNVIDIA:mainfrom
ptrendx:pr_github_build

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 2, 2026

Description

Run the build in the minimal container and use the pip wheel based CUDA to minimize the used disk space.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 24 commits February 24, 2026 15:16
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx marked this pull request as ready for review March 10, 2026 18:02
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR refactors the GitHub Actions CI build system by replacing the Docker-in-Docker + sccache approach with pre-built minimal containers (using pip-wheel-based CUDA) and native ccache via actions/cache, aiming to reduce disk usage and build complexity. New Dockerfile.pytorch and Dockerfile.all define the pre-built CI images, build_tools/utils.py gains get_cuda_library_dirs() for wheel-based CUDA library discovery, and the NCCL include is restructured across logging.h and cgemm_helper.h.

Key observations:

  • The pytorch and all CI jobs pull containers tagged :latest from a personal GHCR account (ghcr.io/ptrendx/*), making builds non-reproducible and dependent on an account outside the nvidia org.
  • CUDA_PATH is hardcoded to the cu13 subdirectory in both build.yml and the Dockerfiles; a future CUDA version bump will require a manual update to avoid a silent breakage.
  • get_cuda_library_dirs() carries an incorrect -> Tuple[str, str] return-type annotation (it returns List[Path]), copied from get_cuda_include_dirs.
  • Moving #include "nccl.h" inside the NVTE_WITH_CUBLASMP guard in logging.h while leaving NVTE_CHECK_NCCL defined outside it creates a latent compilation hazard for any future code that includes logging.h and uses the macro without separately including nccl.h.

Confidence Score: 3/5

  • PR is functional but introduces reproducibility risks in CI and a latent C++ header guard issue worth resolving before merging.
  • The CI refactor is a meaningful improvement in build simplicity and disk usage. However, the use of mutable :latest image tags under a personal GHCR account for the pytorch and all jobs undermines build reproducibility and auditability. The NCCL header restructure in logging.h works today but leaves a trap for future contributors. The incorrect return-type annotation is a minor but concrete error in new code.
  • .github/workflows/build.yml (mutable container tags, hardcoded cu13), transformer_engine/common/util/logging.h (NVTE_CHECK_NCCL outside NVTE_WITH_CUBLASMP guard)

Important Files Changed

Filename Overview
.github/workflows/build.yml Major CI refactor: replaces Docker-in-Docker + sccache with pre-built containers + ccache; pytorch/all jobs now use mutable :latest tags under a personal GHCR account, and CUDA paths are hardcoded to cu13.
build_tools/ci/Dockerfile.pytorch New Dockerfile for the PyTorch CI container; installs pip-wheel CUDA, creates unversioned symlinks for libcudart/libcublas/libnccl, but is missing an unversioned libcudnn symlink (previously raised in review thread).
build_tools/ci/Dockerfile.all New Dockerfile for the combined PyTorch+JAX CI container; same symlink structure as Dockerfile.pytorch plus JAX installation; same missing libcudnn.so symlink concern.
build_tools/utils.py Adds get_cuda_library_dirs() for wheel-based CUDA lib discovery and a CUDA 13 package name fallback in cuda_version(); the new function carries an incorrect Tuple[str,str] return-type annotation.
build_tools/jax.py Wires get_cuda_library_dirs() into the JAX Pybind11 extension so linker can find NCCL and other CUDA libraries when building against pip wheels; change is correct and self-contained.
transformer_engine/common/util/logging.h Moves #include "nccl.h" inside the NVTE_WITH_CUBLASMP guard, but the NVTE_CHECK_NCCL macro (which uses NCCL types) remains defined unconditionally outside the guard, creating a latent compilation hazard.
transformer_engine/jax/csrc/extensions/cgemm_helper.h Adds #include <nccl.h> to compensate for nccl.h no longer being unconditionally included via logging.h; keeps the only NCCL macro user (cgemm_helper.cpp) compiling correctly.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions Runner
    participant Cache as actions/cache (ccache)
    participant Container as Pre-built Container<br/>(ghcr.io/ptrendx/*)
    participant Pip as pip install
    participant CUDAWheels as NVIDIA CUDA Wheels<br/>(nvidia-cuda-nvcc, etc.)

    GH->>Container: Pull image (pytorch / all jobs)
    Container-->>GH: Container environment with<br/>torch + CUDA wheel libs + ccache

    GH->>Cache: Restore /root/.ccache
    Cache-->>GH: Cached compiler objects (if any)

    GH->>Pip: pip install --no-build-isolation .<br/>NVTE_BUILD_USE_NVIDIA_WHEELS=1
    Pip->>CUDAWheels: get_cuda_library_dirs() → nvidia/*/lib
    CUDAWheels-->>Pip: Library paths (libcudart, libcublas, libnccl …)
    Pip-->>GH: Built transformer_engine wheel

    GH->>Cache: Save updated /root/.ccache

    GH->>GH: python3 tests/*/test_sanity_import.py
Loading

Comments Outside Diff (1)

  1. transformer_engine/common/util/logging.h, line 109-115 (link)

    NVTE_CHECK_NCCL defined outside NVTE_WITH_CUBLASMP guard despite nccl.h being moved inside it

    After this PR, #include "nccl.h" only appears inside #ifdef NVTE_WITH_CUBLASMP ... #endif (lines 15–19). However, the NVTE_CHECK_NCCL macro, which expands to code using ncclResult_t, ncclSuccess, and ncclGetErrorString, is still defined unconditionally outside that guard.

    This currently works only because the sole caller (cgemm_helper.cpp) pulls in nccl.h via cgemm_helper.h before it includes logging.h. Any future file that includes logging.h and expands NVTE_CHECK_NCCL without also including nccl.h first will fail to compile with undefined NCCL types.

    To make the API self-contained and prevent future breakage, consider moving the NVTE_CHECK_NCCL macro inside the NVTE_WITH_CUBLASMP guard (matching where nccl.h now lives), or restoring the unconditional #include "nccl.h" and keeping cgemm_helper.h's additional include as an explicit dependency of that header only.

Last reviewed commit: f3d0aa6

ptrendx added 2 commits March 10, 2026 21:42
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Comment on lines +22 to +28

# Create symlinks for CUDA libraries
RUN CUDA_PATH=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cu13')") && \
ln -s $CUDA_PATH/lib/libcudart.so.13 $CUDA_PATH/lib/libcudart.so && \
ln -s $CUDA_PATH/lib/libcublas.so.13 $CUDA_PATH/lib/libcublas.so && \
ln -s $CUDA_PATH/../nccl/lib/libnccl.so.2 $CUDA_PATH/../nccl/lib/libnccl.so && \
ln -s $CUDA_PATH/lib $CUDA_PATH/lib64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing libcudnn.so unversioned symlink

The symlink block creates unversioned aliases for libcudart, libcublas, and libnccl (needed for -lcudart, -lcublas, -lnccl at link time), but no equivalent symlink for libcudnn.so. The logging.h header unconditionally includes <cudnn.h>, and the build links against cudnn. If the cudnn pip wheel (typically pulled in as a torch dependency) only ships a versioned filename such as libcudnn.so.9, the linker will fail to resolve -lcudnn without an unversioned symlink.

Consider adding a similar symlink for cudnn, e.g.:

CUDNN_LIB=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cudnn/lib')") && \
    ln -s $CUDNN_LIB/libcudnn.so.9 $CUDNN_LIB/libcudnn.so

The same applies to Dockerfile.all.

name: 'PyTorch'
runs-on: ubuntu-latest
container:
image: ghcr.io/ptrendx/te_gha_pytorch:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change to use GH packages

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Comment on lines 54 to 55
steps:
- name: Move /var/lib/docker/
shell: bash -euxo pipefail {0}
run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker"

- name: Maximize build space
uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794
with:
root-reserve-mb: 5120
temp-reserve-mb: 32
swap-size-mb: 10240
remove-dotnet: 'true'
remove-android: 'true'
remove-haskell: 'true'
remove-codeql: 'true'
build-mount-path: '/var/lib/docker/'

- name: Restore /var/lib/docker/
shell: bash -euxo pipefail {0}
run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker"

- name: 'Checkout'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable :latest container tags break reproducibility

Both the pytorch and all jobs pull containers by their :latest tag (ghcr.io/ptrendx/te_gha_pytorch:latest and ghcr.io/ptrendx/te_gha_all:latest). Mutable tags mean:

  1. The CI build can silently change behavior whenever the image is updated, making it very difficult to bisect regressions.
  2. The images are hosted under a personal account (ptrendx) rather than the official nvidia org on GHCR — if account permissions change, CI breaks with no notice.

Consider pinning to an immutable digest, e.g.:

image: ghcr.io/ptrendx/te_gha_pytorch@sha256:<digest>

Or use a date-versioned tag (te_gha_pytorch:2026-03-16) that is regenerated on image updates, keeping reproducibility within a given CI run while still making updates explicit.

The same applies to the all job container at line 120.



@functools.lru_cache(maxsize=None)
def get_cuda_library_dirs() -> Tuple[str, str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return type annotation (copy-paste from get_cuda_include_dirs)

The return type is annotated as Tuple[str, str] but the function returns either an empty list or a List[Path] (via the list comprehension on line 274–278). This annotation is wrong and was likely copied from get_cuda_include_dirs.

Suggested change
def get_cuda_library_dirs() -> Tuple[str, str]:
def get_cuda_library_dirs() -> List[Path]:

Comment on lines +60 to +63
run: df -lh
- uses: actions/cache@v4
with:
path: /root/.ccache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded cu13 CUDA path prefix

The CUDA_PATH and CUDNN_PATH exports hardcode the cu13 subdirectory name, which is tied to a specific CUDA major version. When CUDA 14 (or another version) is introduced, this will need a manual update to match the new cu14 subdirectory — making the failure mode invisible until the build breaks.

Consider deriving the path dynamically:

export CUDA_PATH=$(python3 -c "import nvidia, glob, os; paths=glob.glob(os.path.join(list(nvidia.__path__)[0], 'cu*')); print(sorted(paths)[-1])")

Or document the intentional coupling to CUDA 13 with a comment so the future update is obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants