Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR refactors the GitHub Actions CI build system by replacing the Docker-in-Docker + Key observations:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GH as GitHub Actions Runner
participant Cache as actions/cache (ccache)
participant Container as Pre-built Container<br/>(ghcr.io/ptrendx/*)
participant Pip as pip install
participant CUDAWheels as NVIDIA CUDA Wheels<br/>(nvidia-cuda-nvcc, etc.)
GH->>Container: Pull image (pytorch / all jobs)
Container-->>GH: Container environment with<br/>torch + CUDA wheel libs + ccache
GH->>Cache: Restore /root/.ccache
Cache-->>GH: Cached compiler objects (if any)
GH->>Pip: pip install --no-build-isolation .<br/>NVTE_BUILD_USE_NVIDIA_WHEELS=1
Pip->>CUDAWheels: get_cuda_library_dirs() → nvidia/*/lib
CUDAWheels-->>Pip: Library paths (libcudart, libcublas, libnccl …)
Pip-->>GH: Built transformer_engine wheel
GH->>Cache: Save updated /root/.ccache
GH->>GH: python3 tests/*/test_sanity_import.py
|
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
|
||
| # Create symlinks for CUDA libraries | ||
| RUN CUDA_PATH=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cu13')") && \ | ||
| ln -s $CUDA_PATH/lib/libcudart.so.13 $CUDA_PATH/lib/libcudart.so && \ | ||
| ln -s $CUDA_PATH/lib/libcublas.so.13 $CUDA_PATH/lib/libcublas.so && \ | ||
| ln -s $CUDA_PATH/../nccl/lib/libnccl.so.2 $CUDA_PATH/../nccl/lib/libnccl.so && \ | ||
| ln -s $CUDA_PATH/lib $CUDA_PATH/lib64 |
There was a problem hiding this comment.
Missing libcudnn.so unversioned symlink
The symlink block creates unversioned aliases for libcudart, libcublas, and libnccl (needed for -lcudart, -lcublas, -lnccl at link time), but no equivalent symlink for libcudnn.so. The logging.h header unconditionally includes <cudnn.h>, and the build links against cudnn. If the cudnn pip wheel (typically pulled in as a torch dependency) only ships a versioned filename such as libcudnn.so.9, the linker will fail to resolve -lcudnn without an unversioned symlink.
Consider adding a similar symlink for cudnn, e.g.:
CUDNN_LIB=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cudnn/lib')") && \
ln -s $CUDNN_LIB/libcudnn.so.9 $CUDNN_LIB/libcudnn.soThe same applies to Dockerfile.all.
| name: 'PyTorch' | ||
| runs-on: ubuntu-latest | ||
| container: | ||
| image: ghcr.io/ptrendx/te_gha_pytorch:latest |
There was a problem hiding this comment.
Need to change to use GH packages
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
| steps: | ||
| - name: Move /var/lib/docker/ | ||
| shell: bash -euxo pipefail {0} | ||
| run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker" | ||
|
|
||
| - name: Maximize build space | ||
| uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794 | ||
| with: | ||
| root-reserve-mb: 5120 | ||
| temp-reserve-mb: 32 | ||
| swap-size-mb: 10240 | ||
| remove-dotnet: 'true' | ||
| remove-android: 'true' | ||
| remove-haskell: 'true' | ||
| remove-codeql: 'true' | ||
| build-mount-path: '/var/lib/docker/' | ||
|
|
||
| - name: Restore /var/lib/docker/ | ||
| shell: bash -euxo pipefail {0} | ||
| run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker" | ||
|
|
||
| - name: 'Checkout' |
There was a problem hiding this comment.
Mutable :latest container tags break reproducibility
Both the pytorch and all jobs pull containers by their :latest tag (ghcr.io/ptrendx/te_gha_pytorch:latest and ghcr.io/ptrendx/te_gha_all:latest). Mutable tags mean:
- The CI build can silently change behavior whenever the image is updated, making it very difficult to bisect regressions.
- The images are hosted under a personal account (
ptrendx) rather than the officialnvidiaorg on GHCR — if account permissions change, CI breaks with no notice.
Consider pinning to an immutable digest, e.g.:
image: ghcr.io/ptrendx/te_gha_pytorch@sha256:<digest>
Or use a date-versioned tag (te_gha_pytorch:2026-03-16) that is regenerated on image updates, keeping reproducibility within a given CI run while still making updates explicit.
The same applies to the all job container at line 120.
|
|
||
|
|
||
| @functools.lru_cache(maxsize=None) | ||
| def get_cuda_library_dirs() -> Tuple[str, str]: |
There was a problem hiding this comment.
Incorrect return type annotation (copy-paste from get_cuda_include_dirs)
The return type is annotated as Tuple[str, str] but the function returns either an empty list or a List[Path] (via the list comprehension on line 274–278). This annotation is wrong and was likely copied from get_cuda_include_dirs.
| def get_cuda_library_dirs() -> Tuple[str, str]: | |
| def get_cuda_library_dirs() -> List[Path]: |
| run: df -lh | ||
| - uses: actions/cache@v4 | ||
| with: | ||
| path: /root/.ccache |
There was a problem hiding this comment.
Hardcoded cu13 CUDA path prefix
The CUDA_PATH and CUDNN_PATH exports hardcode the cu13 subdirectory name, which is tied to a specific CUDA major version. When CUDA 14 (or another version) is introduced, this will need a manual update to match the new cu14 subdirectory — making the failure mode invisible until the build breaks.
Consider deriving the path dynamically:
export CUDA_PATH=$(python3 -c "import nvidia, glob, os; paths=glob.glob(os.path.join(list(nvidia.__path__)[0], 'cu*')); print(sorted(paths)[-1])")Or document the intentional coupling to CUDA 13 with a comment so the future update is obvious.
Description
Run the build in the minimal container and use the pip wheel based CUDA to minimize the used disk space.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: