Add NIXL backend by x41lakazam · Pull Request #6016 · NVIDIA/Fuser

x41lakazam · 2026-02-26T08:30:36Z

No description provided.

github-actions · 2026-02-26T08:54:21Z

Review updated until commit 7283aa8

Description

Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication
Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs
Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system
Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files

Enhancement

5 files

nixl.h `Define NixlBackend and NixlTransferHandle classes`	+221/-0
nixl.cpp `Implement NIXL backend with UCX for GPU transfers`	+474/-0
multidevice.h `Add kNixl to CommunicatorBackend enum`	+1/-1
communicator.h `Add nixl_available_ flag and backend check`	+7/-1
communicator.cpp `Initialize nixl_available_ and add NIXL case to output`	+9/-1

Configuration changes

2 files

CMakeLists.txt `Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration`	+39/-0
utils.py `Add build_with_nixl config and cmake flag`	+4/-0

Documentation

1 files

setup.py `Document NVFUSER_BUILD_WITH_NIXL build option`	+3/-0

Tests

1 files

test_multidevice_nixl.cpp `Add tests for NIXL backend functionality`	+289/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}

Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}

Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

samnordmann

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

csrc/multidevice/communicator.h

csrc/multidevice/nixl.h

csrc/multidevice/nixl.cpp

samnordmann · 2026-02-26T16:07:56Z

csrc/multidevice/nixl.cpp

+#endif
+}
+
+void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {


This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

samnordmann · 2026-02-26T16:19:07Z

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

samnordmann · 2026-02-26T16:58:21Z

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

samnordmann · 2026-03-02T10:27:38Z

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

greptile-apps · 2026-03-02T16:30:02Z

Greptile Summary

This PR introduces a new NIXL (NVIDIA Interconnect eXchange Layer) communication backend for nvFuser's multi-device infrastructure. It adds a singleton NixlBackend wrapping a nixlAgent with UCX transport, enabling one-sided RDMA GPU-to-GPU tensor transfers via registerTensors / prepareTransfer / postTransfer / waitTransfer. Metadata is exchanged across ranks through the existing TCPStore. The feature is fully optional: it is gated by NVFUSER_BUILD_WITH_NIXL at build time and by a VRAM probe at runtime, leaving existing builds unaffected.

Key additions:

csrc/multidevice/nixl.h / nixl.cpp — core backend (singleton, registration, transfer lifecycle)
cmake/deps/handle_nixl.cmake — optional CMake detection with CUDA version constraint check
tools/install-nixl.sh — CI helper to install NIXL in pip or source mode
python/tools/prereqs/requirements/nixl.py — dependency reporter integration
tests/cpp/test_multidevice_nixl.cpp — unit + end-to-end ring-transfer tests

Issues found:

The NixlTransferHandleImpl destructor calls agent->releaseXferReq unconditionally even when posted == true, which can release an in-flight UCX request and lead to undefined behavior if a handle is destroyed while a transfer is still in progress.
tools/install-nixl.sh hardcodes nixl-cu12 in the pip installation path, which will silently install the wrong CUDA-versioned wheel on CUDA 11 or CUDA 13+ systems.
The CUDA major-version constraint check in cmake/deps/handle_nixl.cmake uses the internal nixl._pkg.__name__ attribute; package names with minor-version suffixes (e.g. nixl_cu12_5) will produce false mismatch warnings.

Confidence Score: 3/5

The feature itself is well-structured and opt-in, but several correctness concerns remain unresolved — particularly around in-flight handle destruction and the pip install script, which could cause silent runtime failures in non-CUDA-12 CI environments.
The overall architecture and API design are solid and the feature is off by default (no risk to existing builds). However, the in-flight transfer handle destruction issue is a real correctness hazard, the hardcoded nixl-cu12 in the install script is a practical CI breakage risk for non-CUDA-12 environments, and several distributed edge cases (deadlock on error, posted flag not reset) from prior review rounds are still unresolved. The score reflects good foundational work with a few issues needing attention before this is production-ready.
csrc/multidevice/nixl.cpp (in-flight handle release, error-path posted flag), tools/install-nixl.sh (hardcoded CUDA version), cmake/deps/handle_nixl.cmake (fragile version detection)

Important Files Changed

Filename	Overview
csrc/multidevice/nixl.cpp	Core NIXL backend implementation: adds agent lifecycle, memory registration, metadata exchange via TCPStore, and transfer (post/wait) — several correctness concerns raised in prior threads (deadlock in exchangeMetadata, posted-flag not reset on error, in-flight handle release), most acknowledged but not fully resolved.
csrc/multidevice/nixl.h	Public API header defining NixlTransferHandle, NixlBackend, TensorDesc helpers and serialization utilities; structure is clean with good documentation, minor concern about non-trivial inline serialization helpers in the header.
cmake/deps/handle_nixl.cmake	New CMake handler for NIXL dependency; CUDA major-version constraint check relies on internal nixl Python package naming (_pkg.name) which is fragile and can report false mismatches with minor-version suffixes.
tools/install-nixl.sh	NIXL install script supporting pip and from-source modes; pip mode hardcodes nixl-cu12 which will silently install the wrong CUDA-versioned wheel on non-CUDA-12 systems.
tests/cpp/test_multidevice_nixl.cpp	Comprehensive unit and end-to-end tests for the NIXL backend covering input validation, singleton access, and ring-topology read/write transfers; test design is sound with appropriate GTEST_SKIP guards.

Sequence Diagram

sequenceDiagram
    participant App
    participant NixlBackend
    participant nixlAgent
    participant TCPStore
    participant PeerAgent

    App->>NixlBackend: getInstance()
    NixlBackend->>nixlAgent: create("rank_N", cfg)
    nixlAgent->>nixlAgent: createBackend("UCX")
    nixlAgent->>nixlAgent: VRAM probe (registerMem + prepXferDlist)
    NixlBackend-->>App: NixlBackend& (or isAvailable=false)

    App->>NixlBackend: registerTensors({t1, t2})
    NixlBackend->>nixlAgent: registerMem(dlist)
    NixlBackend->>nixlAgent: getLocalMD(local_md)
    NixlBackend->>TCPStore: set("nixl_agent_md_rank_N", local_md)
    NixlBackend->>TCPStore: get("nixl_agent_md_rank_Peer")
    TCPStore-->>NixlBackend: remote_md (blocks until peer sets key)
    NixlBackend->>nixlAgent: loadRemoteMD(remote_md)
    NixlBackend->>NixlBackend: barrier() [all ranks sync]
    NixlBackend->>TCPStore: deleteKey("nixl_agent_md_rank_N")

    App->>App: Exchange TensorDescs via TCPStore

    App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
    NixlBackend->>nixlAgent: createXferReq(op, local_dlist, remote_dlist, peer_agent)
    NixlBackend-->>App: NixlTransferHandle

    App->>NixlBackend: postTransfer(handle)
    NixlBackend->>nixlAgent: postXferReq(xfer_handle)

    App->>NixlBackend: waitTransfer(handle)
    loop poll until NIXL_SUCCESS
        NixlBackend->>nixlAgent: getXferStatus(xfer_handle)
        nixlAgent-->>NixlBackend: NIXL_IN_PROG / NIXL_SUCCESS
        NixlBackend->>NixlBackend: this_thread::yield()
    end
    NixlBackend-->>App: done (posted=false)

    App->>NixlBackend: deregisterTensors({t1, t2})
    NixlBackend->>nixlAgent: deregisterMem(dlist)
    NixlBackend->>NixlBackend: exchangeMetadata() [collective]

Comments Outside Diff (1)

csrc/multidevice/nixl.cpp, line 32-40 (link)

Destructor releases transfer handle that may still be in-flight

~NixlTransferHandleImpl() calls agent->releaseXferReq(xfer_handle) whenever xfer_handle is non-null, regardless of the current posted state. If a NixlTransferHandle goes out of scope after postTransfer() but before waitTransfer() (e.g. the caller catches an exception, or a EXPECT_THROW in a test discards the handle), the destructor will call releaseXferReq on an active request. The NIXL/UCX semantics for releasing an in-flight request are undefined, and depending on the UCX transport this can corrupt the remote buffer or crash.

Consider waiting for (or cancelling) the request before releasing it:
```
~NixlTransferHandleImpl() {
  if (xfer_handle) {
    if (posted) {
      // Best-effort wait; ignore errors — we are in a destructor.
      nixl_status_t s;
      do { s = agent->getXferStatus(xfer_handle); }
      while (s == NIXL_IN_PROG);
    }
    agent->releaseXferReq(xfer_handle);
  }
}
```

_{Last reviewed commit: 5eca79f}

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

csrc/multidevice/nixl.cpp

csrc/multidevice/communicator.h

csrc/multidevice/nixl.cpp

csrc/multidevice/communicator.cpp

CMakeLists.txt

csrc/multidevice/nixl.cpp

tests/cpp/test_multidevice_nixl.cpp

mdavis36

Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)

python/tools/prereqs/requirements/nixl.py

python/utils.py

CMakeLists.txt

x41lakazam · 2026-03-11T10:11:56Z

!test

samnordmann · 2026-03-11T10:15:18Z

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

samnordmann · 2026-03-11T10:15:28Z

!test

samnordmann · 2026-03-11T13:51:34Z

!test

x41lakazam · 2026-03-12T10:28:41Z

!build

…all-nixl.sh

x41lakazam · 2026-03-14T21:58:22Z

!test

x41lakazam · 2026-03-17T10:12:11Z

!test

x41lakazam · 2026-03-17T12:55:44Z

!test

samnordmann and others added 9 commits January 21, 2026 02:11

first working dispatch and combine primitive for k=1

cf77bdb

add comments and cleanup

66e7811

add kernel based a2av and cuda backend for d/c

dda9aa7

unstable - add nixl backend

7aa2de8

unstable

9a8a377

add python build changes for nixl

0f21528

fix typo

6144827

merge main

04a9133

restore main:

b32587a

fix bug where zero-length buffer was passed to nixl

f8a94fc

samnordmann reviewed Feb 26, 2026

View reviewed changes

Reduce probe size to 1

a6b6f87

Address PR comments.

95460af

x41lakazam added the Multi-GPU label Mar 2, 2026

x41lakazam marked this pull request as ready for review March 2, 2026 16:23

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

csrc/multidevice/nixl.cpp Outdated Show resolved Hide resolved

x41lakazam added 4 commits March 4, 2026 11:56

typos

41ec0ac

set getAgentName to inline

d63ffd7

fix comments in nixl.cpp

86e5028

clean ifdef USE_NIXL statements

7283aa8

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

csrc/multidevice/nixl.cpp Show resolved Hide resolved

csrc/multidevice/nixl.cpp Show resolved Hide resolved

csrc/multidevice/communicator.h Outdated Show resolved Hide resolved

x41lakazam added 4 commits March 8, 2026 11:58

inline exchangeMetadata inside registerTensors

a085c54

include deviceId (rank) inside TensorDesc

13ae58f

remove useless handleImpl.isPrepared

149c15a

add thread yield in wait transfer loop

1b41788

fix CI

b283c2a

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

csrc/multidevice/nixl.cpp Show resolved Hide resolved

csrc/multidevice/communicator.cpp Show resolved Hide resolved

mdavis36 reviewed Mar 9, 2026

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

Separate device and rank in tensordesc for more clarity

c4726d0

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

csrc/multidevice/nixl.cpp Show resolved Hide resolved

csrc/multidevice/nixl.cpp Show resolved Hide resolved

tests/cpp/test_multidevice_nixl.cpp Show resolved Hide resolved

mdavis36 mentioned this pull request Mar 9, 2026

[Feature Request] CMake FindPackage support and version exposure ai-dynamo/nixl#1417

Open

x41lakazam added 3 commits March 10, 2026 11:36

fix linter

7ff4aae

Update cmake config

0739240

Fix CI

5560230

mdavis36 reviewed Mar 10, 2026

View reviewed changes

python/tools/prereqs/requirements/nixl.py Show resolved Hide resolved

python/utils.py Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

x41lakazam added 6 commits March 10, 2026 17:41

Fix no-headers in NIXL install instructions (Cmake config)

ec01db9

Replace NVFUSER_STANDALONE_BUILD_WITH_NIXL by NVFUSER_BUILD_WITH_NIXL

67a92f0

move nixl linkage from CmakeList to handle_nixl.cmake

dae35aa

move TPL locs to handle_nixl.cmake

a9fb56d

Fix - move nixl linkage to handle_nixl.cmake

e364949

fix linter

f74078f

Add NIXL to CI image

9847a11

x41lakazam and others added 2 commits March 11, 2026 18:36

remove import nixl from install-nixl.sh

40758d6

Add transitive shared libs deps for nixl

bf471ea

Add nixl*.mesonpy.libs and nixl*.libs as shared lib dirs in CI's inst…

df6c384

…all-nixl.sh

try to make nixl tests work

1c10779

remove nixl from clang build

5eca79f

Conversation

x41lakazam commented Feb 26, 2026

Uh oh!

github-actions bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samnordmann Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

x41lakazam Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Feb 26, 2026

Uh oh!

samnordmann commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samnordmann commented Mar 2, 2026

Uh oh!

greptile-apps bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mdavis36 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

x41lakazam commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

samnordmann commented Mar 11, 2026

Uh oh!

x41lakazam commented Mar 12, 2026

Uh oh!

x41lakazam commented Mar 14, 2026

Uh oh!

x41lakazam commented Mar 17, 2026

Uh oh!

x41lakazam commented Mar 17, 2026

Uh oh!

Reviewers

github-actions bot commented Feb 26, 2026 •

edited

Loading

samnordmann commented Feb 26, 2026 •

edited

Loading

greptile-apps bot commented Mar 2, 2026 •

edited

Loading