Skip to content

Add NIXL backend #6016

Open
x41lakazam wants to merge 42 commits intomainfrom
dispatch_combine/nixl_backend
Open

Add NIXL backend #6016
x41lakazam wants to merge 42 commits intomainfrom
dispatch_combine/nixl_backend

Conversation

@x41lakazam
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 26, 2026

Review updated until commit 7283aa8

Description

  • Add NIXL backend for GPU-to-GPU RDMA transfers in multi-device communication

  • Implement tensor registration, metadata exchange, and transfer preparation/post/wait APIs

  • Add NIXL build option (NVFUSER_BUILD_WITH_NIXL) to CMake and Python build system

  • Include comprehensive tests for transfer handles, validation, and end-to-end transfers

Changes walkthrough

Relevant files
Enhancement
5 files
nixl.h
Define NixlBackend and NixlTransferHandle classes               
+221/-0 
nixl.cpp
Implement NIXL backend with UCX for GPU transfers               
+474/-0 
multidevice.h
Add kNixl to CommunicatorBackend enum                                       
+1/-1     
communicator.h
Add nixl_available_ flag and backend check                             
+7/-1     
communicator.cpp
Initialize nixl_available_ and add NIXL case to output     
+9/-1     
Configuration changes
2 files
CMakeLists.txt
Add NVFUSER_STANDALONE_BUILD_WITH_NIXL option and configuration
+39/-0   
utils.py
Add build_with_nixl config and cmake flag                               
+4/-0     
Documentation
1 files
setup.py
Document NVFUSER_BUILD_WITH_NIXL build option                       
+3/-0     
Tests
1 files
test_multidevice_nixl.cpp
Add tests for NIXL backend functionality                                 
+289/-0 

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Spin loop performance concern

The waitTransfer() function at line 385-399 uses a busy-wait spin loop to poll transfer status.
This can cause high CPU usage and may not be optimal for latency-sensitive workloads.
Consider if this should yield to other threads or use a more efficient waiting mechanism.

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
  NVF_ERROR(handle.isValid(), "Cannot wait on an invalid handle");
  NVF_ERROR(handle.impl_->posted, "Transfer has not been posted yet");

  // TODO - check this spin loop
  NixlXferStatus xfer_status;
  do {
    xfer_status = getTransferStatus(handle);
    NVF_ERROR(
        xfer_status != NixlXferStatus::kError,
        "NIXL transfer completed with an error");
  } while (xfer_status == NixlXferStatus::kInProgress);

  handle.impl_->posted = false;
}
Metadata exchange scalability

The exchangeMetadata() function performs O(world_size) iterations to fetch metadata from all peers
and uses a barrier. This may not scale well to large distributed configurations.
Consider if there's a more efficient approach for metadata exchange.

void NixlBackend::Impl::exchangeMetadata() {
  nixl_blob_t local_md;
  nixl_status_t md_status = agent_->getLocalMD(local_md);
  NVF_ERROR(
      md_status == NIXL_SUCCESS,
      "NIXL getLocalMD failed with status ",
      static_cast<int>(md_status));

  auto* store = communicator_.getTcpStore();
  const auto my_rank = communicator_.deviceId();
  const auto world_size = communicator_.size();

  std::string md_key_prefix = "nixl_agent_md_rank_";
  store->set(
      md_key_prefix + std::to_string(my_rank),
      std::vector<uint8_t>(local_md.begin(), local_md.end()));

  for (int64_t rank = 0; rank < world_size; ++rank) {
    if (rank == my_rank) {
      continue;
    }
    // Fetch & load MD
    auto bytes = store->get(md_key_prefix + std::to_string(rank));
    nixl_blob_t remote_md(bytes.begin(), bytes.end());
    std::string remote_agent_name;
    nixl_status_t status = agent_->loadRemoteMD(remote_md, remote_agent_name);
    NVF_ERROR(
        status == NIXL_SUCCESS,
        "NIXL loadRemoteMD failed for rank ",
        rank,
        " with status ",
        static_cast<int>(status));
  }

  // Barrier before deleting keys so no rank reads a deleted key.
  communicator_.barrier();

  store->deleteKey(md_key_prefix + std::to_string(my_rank));
  metadata_exchanged_ = true;
}
Probe mechanism validation

The UCX CUDA support probe (lines 168-210) is a good defensive addition. However, verify that
the probe correctly handles all edge cases where UCX might claim VRAM support but actually
misclassify memory. Consider adding logging when the probe fails.

// Probe: verify that VRAM (CUDA GPU memory) is actually usable with
// the UCX backend. Some UCX installations lack CUDA support, causing
// registerMem to silently misclassify VRAM as host memory. We detect
// this by registering a small buffer and asking NIXL to prepare a
// local descriptor list for VRAM -- if no backend claims VRAM, the
// probe fails and we mark the backend as unavailable.
{
  constexpr int64_t kProbeBytes = 1;
  auto probe = at::empty(
      {kProbeBytes},
      at::TensorOptions().dtype(at::kByte).device(
          at::kCUDA, communicator.deviceId()));
  size_t nbytes = static_cast<size_t>(probe.nbytes());
  uintptr_t addr = reinterpret_cast<uintptr_t>(probe.data_ptr());
  uint32_t dev_idx = static_cast<uint32_t>(probe.device().index());

  NVF_ERROR(nbytes > 0, "NIXL probe: unexpected zero-byte tensor");
  NVF_ERROR(addr != 0, "NIXL probe: null data pointer");

  nixl_reg_dlist_t reg_dlist(VRAM_SEG);
  reg_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixl_status_t reg_status = impl->agent_->registerMem(reg_dlist);
  if (reg_status != NIXL_SUCCESS) {
    return nullptr;
  }

  nixl_xfer_dlist_t xfer_dlist(VRAM_SEG);
  xfer_dlist.addDesc({addr, nbytes, static_cast<uint64_t>(dev_idx)});

  nixlDlistH* dlist_handle = nullptr;
  nixl_status_t prep_status =
      impl->agent_->prepXferDlist(NIXL_INIT_AGENT, xfer_dlist, dlist_handle);

  if (dlist_handle) {
    impl->agent_->releasedDlistH(dlist_handle);
  }
  impl->agent_->deregisterMem(reg_dlist);

  if (prep_status != NIXL_SUCCESS) {
    return nullptr;
  }
}

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.

#endif
}

void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big

@samnordmann
Copy link
Collaborator

@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package

@samnordmann
Copy link
Collaborator

samnordmann commented Feb 26, 2026

unless it is already shipped in some DLFW package

https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029

@samnordmann
Copy link
Collaborator

Note the build error in the CI

  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:144:17: error: private field 'communicator_' is not used [-Werror,-Wunused-private-field]
    144 |   Communicator& communicator_;
        |                 ^
  /home/runner/work/Fuser/Fuser/csrc/multidevice/nixl.cpp:146:8: error: private field 'metadata_exchanged_' is not used [-Werror,-Wunused-private-field]
    146 |   bool metadata_exchanged_ = false;
        |        ^

@x41lakazam x41lakazam marked this pull request as ready for review March 2, 2026 16:23
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR introduces a new NIXL (NVIDIA Interconnect eXchange Layer) communication backend for nvFuser's multi-device infrastructure. It adds a singleton NixlBackend wrapping a nixlAgent with UCX transport, enabling one-sided RDMA GPU-to-GPU tensor transfers via registerTensors / prepareTransfer / postTransfer / waitTransfer. Metadata is exchanged across ranks through the existing TCPStore. The feature is fully optional: it is gated by NVFUSER_BUILD_WITH_NIXL at build time and by a VRAM probe at runtime, leaving existing builds unaffected.

Key additions:

  • csrc/multidevice/nixl.h / nixl.cpp — core backend (singleton, registration, transfer lifecycle)
  • cmake/deps/handle_nixl.cmake — optional CMake detection with CUDA version constraint check
  • tools/install-nixl.sh — CI helper to install NIXL in pip or source mode
  • python/tools/prereqs/requirements/nixl.py — dependency reporter integration
  • tests/cpp/test_multidevice_nixl.cpp — unit + end-to-end ring-transfer tests

Issues found:

  • The NixlTransferHandleImpl destructor calls agent->releaseXferReq unconditionally even when posted == true, which can release an in-flight UCX request and lead to undefined behavior if a handle is destroyed while a transfer is still in progress.
  • tools/install-nixl.sh hardcodes nixl-cu12 in the pip installation path, which will silently install the wrong CUDA-versioned wheel on CUDA 11 or CUDA 13+ systems.
  • The CUDA major-version constraint check in cmake/deps/handle_nixl.cmake uses the internal nixl._pkg.__name__ attribute; package names with minor-version suffixes (e.g. nixl_cu12_5) will produce false mismatch warnings.

Confidence Score: 3/5

  • The feature itself is well-structured and opt-in, but several correctness concerns remain unresolved — particularly around in-flight handle destruction and the pip install script, which could cause silent runtime failures in non-CUDA-12 CI environments.
  • The overall architecture and API design are solid and the feature is off by default (no risk to existing builds). However, the in-flight transfer handle destruction issue is a real correctness hazard, the hardcoded nixl-cu12 in the install script is a practical CI breakage risk for non-CUDA-12 environments, and several distributed edge cases (deadlock on error, posted flag not reset) from prior review rounds are still unresolved. The score reflects good foundational work with a few issues needing attention before this is production-ready.
  • csrc/multidevice/nixl.cpp (in-flight handle release, error-path posted flag), tools/install-nixl.sh (hardcoded CUDA version), cmake/deps/handle_nixl.cmake (fragile version detection)

Important Files Changed

Filename Overview
csrc/multidevice/nixl.cpp Core NIXL backend implementation: adds agent lifecycle, memory registration, metadata exchange via TCPStore, and transfer (post/wait) — several correctness concerns raised in prior threads (deadlock in exchangeMetadata, posted-flag not reset on error, in-flight handle release), most acknowledged but not fully resolved.
csrc/multidevice/nixl.h Public API header defining NixlTransferHandle, NixlBackend, TensorDesc helpers and serialization utilities; structure is clean with good documentation, minor concern about non-trivial inline serialization helpers in the header.
cmake/deps/handle_nixl.cmake New CMake handler for NIXL dependency; CUDA major-version constraint check relies on internal nixl Python package naming (_pkg.name) which is fragile and can report false mismatches with minor-version suffixes.
tools/install-nixl.sh NIXL install script supporting pip and from-source modes; pip mode hardcodes nixl-cu12 which will silently install the wrong CUDA-versioned wheel on non-CUDA-12 systems.
tests/cpp/test_multidevice_nixl.cpp Comprehensive unit and end-to-end tests for the NIXL backend covering input validation, singleton access, and ring-topology read/write transfers; test design is sound with appropriate GTEST_SKIP guards.

Sequence Diagram

sequenceDiagram
    participant App
    participant NixlBackend
    participant nixlAgent
    participant TCPStore
    participant PeerAgent

    App->>NixlBackend: getInstance()
    NixlBackend->>nixlAgent: create("rank_N", cfg)
    nixlAgent->>nixlAgent: createBackend("UCX")
    nixlAgent->>nixlAgent: VRAM probe (registerMem + prepXferDlist)
    NixlBackend-->>App: NixlBackend& (or isAvailable=false)

    App->>NixlBackend: registerTensors({t1, t2})
    NixlBackend->>nixlAgent: registerMem(dlist)
    NixlBackend->>nixlAgent: getLocalMD(local_md)
    NixlBackend->>TCPStore: set("nixl_agent_md_rank_N", local_md)
    NixlBackend->>TCPStore: get("nixl_agent_md_rank_Peer")
    TCPStore-->>NixlBackend: remote_md (blocks until peer sets key)
    NixlBackend->>nixlAgent: loadRemoteMD(remote_md)
    NixlBackend->>NixlBackend: barrier() [all ranks sync]
    NixlBackend->>TCPStore: deleteKey("nixl_agent_md_rank_N")

    App->>App: Exchange TensorDescs via TCPStore

    App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
    NixlBackend->>nixlAgent: createXferReq(op, local_dlist, remote_dlist, peer_agent)
    NixlBackend-->>App: NixlTransferHandle

    App->>NixlBackend: postTransfer(handle)
    NixlBackend->>nixlAgent: postXferReq(xfer_handle)

    App->>NixlBackend: waitTransfer(handle)
    loop poll until NIXL_SUCCESS
        NixlBackend->>nixlAgent: getXferStatus(xfer_handle)
        nixlAgent-->>NixlBackend: NIXL_IN_PROG / NIXL_SUCCESS
        NixlBackend->>NixlBackend: this_thread::yield()
    end
    NixlBackend-->>App: done (posted=false)

    App->>NixlBackend: deregisterTensors({t1, t2})
    NixlBackend->>nixlAgent: deregisterMem(dlist)
    NixlBackend->>NixlBackend: exchangeMetadata() [collective]
Loading

Comments Outside Diff (1)

  1. csrc/multidevice/nixl.cpp, line 32-40 (link)

    Destructor releases transfer handle that may still be in-flight

    ~NixlTransferHandleImpl() calls agent->releaseXferReq(xfer_handle) whenever xfer_handle is non-null, regardless of the current posted state. If a NixlTransferHandle goes out of scope after postTransfer() but before waitTransfer() (e.g. the caller catches an exception, or a EXPECT_THROW in a test discards the handle), the destructor will call releaseXferReq on an active request. The NIXL/UCX semantics for releasing an in-flight request are undefined, and depending on the UCX transport this can corrupt the remote buffer or crash.

    Consider waiting for (or cancelling) the request before releasing it:

    ~NixlTransferHandleImpl() {
      if (xfer_handle) {
        if (posted) {
          // Best-effort wait; ignore errors — we are in a destructor.
          nixl_status_t s;
          do { s = agent->getXferStatus(xfer_handle); }
          while (s == NIXL_IN_PROG);
        }
        agent->releaseXferReq(xfer_handle);
      }
    }

Last reviewed commit: 5eca79f

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@mdavis36 mdavis36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)

@x41lakazam
Copy link
Collaborator Author

!test

@samnordmann
Copy link
Collaborator

!test

@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do?
Thanks!

@samnordmann
Copy link
Collaborator

!test

@samnordmann
Copy link
Collaborator

!test

@x41lakazam
Copy link
Collaborator Author

!build

@x41lakazam
Copy link
Collaborator Author

!test

@x41lakazam
Copy link
Collaborator Author

!test

@x41lakazam
Copy link
Collaborator Author

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants