Conversation
|
Review updated until commit 6996d05 Description
|
| Relevant files | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 6 files
| ||||||||||||
| Configuration changes | |||||||||||||
| Tests | 1 files
| ||||||||||||
| Miscellaneous | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Silent fallback to Native backend
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error. This could mask user configuration mistakes. Consider adding validation to warn or error on unknown backend arguments. |
Greptile SummaryThis PR introduces PyTorch's
Issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant SymmetricTensor
participant ensurePyTorchSymmMemBackend
participant Communicator
participant c10d_symm_mem as c10d::symmetric_memory
Caller->>SymmetricTensor: allocate(sizes, dtype, device)
SymmetricTensor->>ensurePyTorchSymmMemBackend: ensurePyTorchSymmMemBackend(backend)
ensurePyTorchSymmMemBackend->>c10d_symm_mem: set_backend("NCCL"|"NVSHMEM") [once]
ensurePyTorchSymmMemBackend->>Communicator: getSymmMemGroupKey(kNccl)
Communicator->>Communicator: getBackendForTeam(all_ranks, kNccl)
Communicator-->>Communicator: register_process_group(team_key, pg)
Communicator-->>ensurePyTorchSymmMemBackend: group_name
ensurePyTorchSymmMemBackend->>Communicator: barrier(kNccl)
ensurePyTorchSymmMemBackend-->>SymmetricTensor: group_name
SymmetricTensor->>c10d_symm_mem: empty_strided_p2p(sizes, strides, dtype, device, alloc_group_name)
c10d_symm_mem-->>SymmetricTensor: local_tensor
SymmetricTensor-->>Caller: local_tensor
Caller->>SymmetricTensor: setupRemoteHandles(tag)
SymmetricTensor->>ensurePyTorchSymmMemBackend: ensurePyTorchSymmMemBackend(backend)
ensurePyTorchSymmMemBackend-->>SymmetricTensor: group_name
SymmetricTensor->>c10d_symm_mem: rendezvous(local_tensor, group_name)
c10d_symm_mem-->>SymmetricTensor: torch_symm_handle_
Note over SymmetricTensor: Sets are_remote_tensors_setup_=true<br/>Sets is_multicast_setup_ if supported
Caller->>SymmetricTensor: remoteTensor(rank)
SymmetricTensor->>c10d_symm_mem: torch_symm_handle_->get_remote_tensor(rank, ...)
c10d_symm_mem-->>SymmetricTensor: remote at::Tensor
SymmetricTensor-->>Caller: remote at::Tensor
Last reviewed commit: "all backends passing" |
| #!/bin/bash | ||
|
|
||
| export CC=clang-20 | ||
| export CXX=clang++-20 | ||
| export LDFLAGS="-fuse-ld=mold" | ||
|
|
||
| export NVFUSER_BUILD_ENABLE_PCH | ||
|
|
||
| export UCC_HOME="/opt/hpcx/ucc" | ||
| export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc" | ||
| export UCX_HOME="/opt/hpcx/ucx" | ||
| export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx" | ||
|
|
||
| # export TORCH_CUDA_ARCH_LIST="9.0" | ||
|
|
||
| export NVFUSER_BUILD_WITH_UCC=1 | ||
| export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser | ||
| export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY | ||
|
|
||
| # Enable debug mode, leave empty for non-debug compilation | ||
| export NVFUSER_BUILD_BUILD_TYPE=Debug | ||
| export RUN_CMAKE="" | ||
|
|
||
| pip install -v -e ./python --no-build-isolation |
There was a problem hiding this comment.
Personal developer build script committed to repository
This script contains machine-specific, hardcoded toolchain paths that are unlikely to work anywhere except the author's development machine:
clang-20andclang++-20— not a standard compiler version available broadly-fuse-ld=mold— requires themoldlinker to be installed/opt/hpcx/uccand/opt/hpcx/ucx— HPC-X installation path specific to the author's environment$BUILD_DIRECTORYis used but never validated; if it is unset,NVFUSER_BUILD_INSTALL_DIRandNVFUSER_BUILD_DIRwill silently be empty strings, likely breaking the build
This kind of personal convenience script should live outside version control (e.g., in a .gitignore-d directory or in the author's home directory). Committing it to the main repo risks confusing other contributors and cluttering the root directory.
| void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) { | ||
| static std::once_flag once; | ||
| std::call_once(once, [backend]() { | ||
| const char* name = nullptr; | ||
| switch (backend) { | ||
| case SymmetricMemoryBackend::PyTorchNccl: | ||
| name = "NCCL"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchNvshmem: | ||
| name = "NVSHMEM"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchCuda: | ||
| name = "CUDA"; | ||
| break; | ||
| default: | ||
| NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend"); | ||
| } | ||
| c10d::symmetric_memory::set_backend(name); | ||
| Communicator& comm = Communicator::getInstance(); | ||
| NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory"); | ||
| c10d::symmetric_memory::set_group_info( | ||
| kPyTorchSymmMemGroupName, | ||
| static_cast<int>(comm.deviceId()), | ||
| static_cast<int>(comm.size()), | ||
| comm.getStore()); | ||
| }); | ||
| } |
There was a problem hiding this comment.
NCCL backend initialization is incomplete — register_process_group is never called
ensurePyTorchSymmMemBackend calls set_group_info but never calls c10d::register_process_group. According to the comment added to communicator.h for getWorldBackendIntrusivePtr:
Returns the world backend as an intrusive_ptr so it can be registered with
c10d::register_process_group(e.g. for PyTorch symmetric memory NCCL rendezvous, which resolves the group by name).
getWorldBackendIntrusivePtr was clearly introduced to supply the backend for this registration, yet the call to c10d::register_process_group is absent from ensurePyTorchSymmMemBackend. PyTorch's NCCL symmetric-memory rendezvous resolves the process group by name at the point it is called; without a prior register_process_group(kPyTorchSymmMemGroupName, ...), the NCCL backend path will fail to locate the group and throw at rendezvous time.
The missing call should be something like:
// After set_group_info, for NCCL backend:
c10d::register_process_group(
kPyTorchSymmMemGroupName,
comm.getWorldBackendIntrusivePtr(CommunicatorBackend::kNccl));The fact that getWorldBackendIntrusivePtr was added in this exact PR but is never invoked strongly suggests this step was accidentally left out.
| // TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) { | ||
| // if (communicator_->size() == 1) { | ||
| // GTEST_SKIP() << "Skipping test for single device"; | ||
| // } | ||
| // SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); | ||
| // if (backend == SymmetricMemoryBackend::Native) { | ||
| // GTEST_SKIP() | ||
| // << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run"; | ||
| // } | ||
|
|
||
| // const int64_t rank = communicator_->deviceId(); | ||
| // const int64_t world_size = communicator_->size(); | ||
|
|
||
| // at::Tensor local_tensor = SymmetricTensor::allocate( | ||
| // {256, 512}, at::ScalarType::Float, communicator_->device()); | ||
| // SymmetricTensor sym_tensor(local_tensor); | ||
|
|
||
| // EXPECT_TRUE(local_tensor.is_cuda()); | ||
| // EXPECT_EQ(local_tensor.numel(), 256 * 512); | ||
|
|
||
| // float local_value = static_cast<float>(rank + 200); | ||
| // local_tensor.fill_(local_value); | ||
|
|
||
| // sym_tensor.setupRemoteHandles(); | ||
|
|
||
| // for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) { | ||
| // void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr(); | ||
| // EXPECT_NE(peer_ptr, nullptr); | ||
|
|
||
| // float peer_value; | ||
| // NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy( | ||
| // &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost)); | ||
|
|
||
| // float expected_value = static_cast<float>(peer_rank + 200); | ||
| // EXPECT_FLOAT_EQ(peer_value, expected_value) | ||
| // << "Rank " << rank << " reading from rank " << peer_rank | ||
| // << " (PyTorch backend)"; | ||
| // } | ||
| // } |
There was a problem hiding this comment.
Entire PyTorch backend correctness test is commented out
PyTorchBackend_RemoteAccessCorrectness is the only test that exercises the new PyTorch backend path end-to-end (allocation → rendezvous → remote access). Leaving it commented out means the three new backend variants (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) have zero test coverage in CI.
The comment says it should be run manually with NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl), but that means regressions in the PyTorch path will go undetected in normal CI runs.
If the test can't pass yet (e.g., because the NCCL register_process_group call is missing), that's a strong signal to fix the underlying issue rather than suppress the test. If the test is intentionally deferred, consider converting it into a proper GTEST_SKIP with an explanatory message so the intent is visible to reviewers and CI.
| std::vector<int64_t> strides(sizes.size()); | ||
| strides.back() = 1; | ||
| for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) { |
There was a problem hiding this comment.
Undefined behavior when sizes is empty (0-dim tensor)
std::vector<int64_t> strides(sizes.size());
strides.back() = 1; // UB if sizes is emptystd::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:
NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.
|
Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok. |
| void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) { | ||
| static std::once_flag once; | ||
| std::call_once(once, [backend]() { | ||
| const char* name = nullptr; | ||
| switch (backend) { | ||
| case SymmetricMemoryBackend::PyTorchNccl: | ||
| name = "NCCL"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchNvshmem: | ||
| name = "NVSHMEM"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchCuda: | ||
| name = "CUDA"; | ||
| break; | ||
| default: | ||
| NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend"); | ||
| } | ||
| c10d::symmetric_memory::set_backend(name); | ||
| Communicator& comm = Communicator::getInstance(); | ||
| NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory"); | ||
| c10d::symmetric_memory::set_group_info( | ||
| kPyTorchSymmMemGroupName, | ||
| static_cast<int>(comm.deviceId()), | ||
| static_cast<int>(comm.size()), | ||
| comm.getStore()); | ||
| }); | ||
| } |
There was a problem hiding this comment.
std::call_once exception-safety leaves set_backend in a permanently broken state on retry
std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.
A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:
// Separate once-flags for each idempotent step, or catch and suppress
// the "already set" error from set_backend on retry:
try {
c10d::symmetric_memory::set_backend(name);
} catch (const std::exception& e) {
// If the backend is already set to the correct name, treat as success.
// Re-throw otherwise.
}
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName, ...);Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.
| void* SymmetricTensor::multicastPtr() const { | ||
| #ifdef NVFUSER_DISTRIBUTED | ||
| if (py_symm_handle_) { | ||
| return py_symm_handle_->has_multicast_support() | ||
| ? py_symm_handle_->get_multicast_ptr() | ||
| : nullptr; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
multicastPtr() silently returns nullptr for PyTorch backend when multicast is not supported, which is inconsistent with the native path (which calls NVF_CHECK(is_multicast_setup_, "Multicast not setup")).
Any caller that does not check for nullptr before using the pointer will trigger a null pointer dereference / silent GPU fault rather than a clear diagnostic error.
Consider throwing or at least asserting instead of silently returning nullptr:
| void* SymmetricTensor::multicastPtr() const { | |
| #ifdef NVFUSER_DISTRIBUTED | |
| if (py_symm_handle_) { | |
| return py_symm_handle_->has_multicast_support() | |
| ? py_symm_handle_->get_multicast_ptr() | |
| : nullptr; | |
| } | |
| #endif | |
| void* SymmetricTensor::multicastPtr() const { | |
| #ifdef NVFUSER_DISTRIBUTED | |
| if (py_symm_handle_) { | |
| NVF_CHECK( | |
| py_symm_handle_->has_multicast_support(), | |
| "Multicast not supported by the selected PyTorch symmetric memory backend."); | |
| return py_symm_handle_->get_multicast_ptr(); | |
| } | |
| #endif | |
| NVF_CHECK(is_multicast_setup_, "Multicast not setup"); | |
| return mc_ptr_; | |
| } |
This brings the error contract in line with the native path, where multicastPtr() always either returns a valid pointer or throws.
| if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) { | ||
| ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend()); |
There was a problem hiding this comment.
getSymmetricMemoryBackend() is invoked twice in back-to-back lines, which redundantly re-parses the option string on each call. A single local variable should be used:
| if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) { | |
| ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend()); | |
| SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); | |
| if (backend != SymmetricMemoryBackend::Native) { | |
| ensurePyTorchSymmMemBackend(backend); |
| TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) { | ||
| SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); | ||
| EXPECT_TRUE( | ||
| backend == SymmetricMemoryBackend::Native || | ||
| backend == SymmetricMemoryBackend::PyTorchNccl || | ||
| backend == SymmetricMemoryBackend::PyTorchNvshmem || | ||
| backend == SymmetricMemoryBackend::PyTorchCuda) | ||
| << "getSymmetricMemoryBackend() returned an invalid backend value"; | ||
| } |
There was a problem hiding this comment.
GetSymmetricMemoryBackend_ReturnsValidBackend test is trivially true. Every branch of getSymmetricMemoryBackend() explicitly returns one of the four enum values listed in the EXPECT_TRUE condition, so there is no code path that could return a fifth value. This test can never fail and provides no meaningful coverage.
If the intent is to document the valid values, a static assertion in ipc_utils.cpp would be more appropriate. If the intent is to test that the env-var parsing correctly maps strings to enum values, the test should set up specific NVFUSER_ENABLE strings and assert the exact expected enum variant (e.g., set pytorch_nccl and assert PyTorchNccl).
samnordmann
left a comment
There was a problem hiding this comment.
Thank you! Some minor comments
Please add test, fix linter, and run the CI with !test command (comment directly on the PR)
Fuser/.github/workflows/lint.yml
Line 83 in 5b210dd
| // Symmetric memory backend and option tests | ||
| // ----------------------------------------------------------------------------- | ||
|
|
||
| TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) { |
| } | ||
| } | ||
|
|
||
| // Same remote-access correctness as BasicAllocation but only runs when |
There was a problem hiding this comment.
This is the only test but it is commented. Either remove it or un-comment it. An idea would be to reuse the pre-existing tests but to parametrize them with the new backends.
csrc/multidevice/symmetric_tensor.h
Outdated
| // - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained. | ||
| // - PyTorch (Nccl, Nvshmem, Cuda): Use PyTorch's symmetric memory | ||
| // (torch.distributed._symmetric_memory) with the chosen transport backend. | ||
| // Select via NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl|pytorch_nvshmem|pytorch_cuda). |
There was a problem hiding this comment.
the selection should also be about native and contain it as an option
csrc/multidevice/symmetric_tensor.h
Outdated
| // further fragment the memory. On the other hand, having our own implementation | ||
| // allows us to experiment more advanced features like contigous view creation. | ||
| // Backends (see SymmetricMemoryBackend in ipc_utils.h): | ||
| // - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained. |
There was a problem hiding this comment.
| // - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained. | |
| // - Native (default): Fuser's own CUDA VMM + IPC implementation. |
csrc/multidevice/symmetric_tensor.h
Outdated
| // When set, remote/multicast APIs delegate to PyTorch symmetric memory. | ||
| c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_; |
There was a problem hiding this comment.
py_ prefix wrongly suggests python.
I am not sure to understand the comment
| // When set, remote/multicast APIs delegate to PyTorch symmetric memory. | |
| c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_; | |
| c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> symm_handle_; |
| #ifdef NVFUSER_DISTRIBUTED | ||
| // PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles). | ||
| if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) { | ||
| ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend()); |
There was a problem hiding this comment.
has already been called in the constructor
| NVF_ERROR( | ||
| false, |
There was a problem hiding this comment.
| NVF_ERROR( | |
| false, | |
| NVF_THROW( |
csrc/multidevice/communicator.h
Outdated
| return store_.get(); | ||
| } | ||
|
|
||
| #ifdef NVFUSER_DISTRIBUTED |
There was a problem hiding this comment.
why do we need guard here?
csrc/multidevice/communicator.h
Outdated
|
|
||
| #ifdef NVFUSER_DISTRIBUTED | ||
| #include <torch/csrc/distributed/c10d/Backend.hpp> | ||
| #include <torch/csrc/distributed/c10d/Store.hpp> |
csrc/multidevice/communicator.h
Outdated
| // Returns the store as an intrusive_ptr for use with PyTorch symmetric | ||
| // memory (c10d::symmetric_memory::set_group_info). | ||
| c10::intrusive_ptr<c10d::Store> getStore() const; | ||
|
|
||
| // Returns the world backend as an intrusive_ptr so it can be registered with | ||
| // c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL | ||
| // rendezvous, which resolves the group by name). | ||
| c10::intrusive_ptr<c10d::Backend> getWorldBackendIntrusivePtr( | ||
| std::optional<CommunicatorBackend> backend = std::nullopt); |
There was a problem hiding this comment.
rather, change the signature of the existing getter method to return intrusive_ptr instead of raw pointer
| std::string Communicator::getSymmMemGroupKey( | ||
| std::optional<CommunicatorBackend> backend) { | ||
| std::vector<RankType> all_ranks(size_); | ||
| std::iota(all_ranks.begin(), all_ranks.end(), 0); | ||
| CommunicatorBackend b = backend.value_or(default_backend_); | ||
| (void)getBackendForTeam(all_ranks, b, "symm_mem_"); | ||
| return getTeamKey(all_ranks, b); | ||
| } |
There was a problem hiding this comment.
getSymmMemGroupKey returns key without "symm_mem_" prefix — mismatch with registered process group
getBackendForTeam(all_ranks, b, "symm_mem_") registers the process group under the key "symm_mem_" + getTeamKey(all_ranks, b) (see the register_process_group call in that function). However, getSymmMemGroupKey then returns just getTeamKey(all_ranks, b) — without the "symm_mem_" prefix.
The returned key is subsequently used in ensurePyTorchSymmMemBackend as the group_name passed to both set_group_info and rendezvous. Newer NCCL builds resolve the process group by name at rendezvous time; they will look for a process group registered as "nccl0,1,..." but only "symm_mem_nccl0,1,..." exists, causing rendezvous to fail.
The current workaround that registers under "0" papers over this for older NCCL, but the mismatch will surface as soon as the TODO comment is resolved and older-NCCL special-casing is removed.
The return statement should return the full team_key including the prefix:
| std::string Communicator::getSymmMemGroupKey( | |
| std::optional<CommunicatorBackend> backend) { | |
| std::vector<RankType> all_ranks(size_); | |
| std::iota(all_ranks.begin(), all_ranks.end(), 0); | |
| CommunicatorBackend b = backend.value_or(default_backend_); | |
| (void)getBackendForTeam(all_ranks, b, "symm_mem_"); | |
| return getTeamKey(all_ranks, b); | |
| } | |
| std::string Communicator::getSymmMemGroupKey( | |
| std::optional<CommunicatorBackend> backend) { | |
| std::vector<RankType> all_ranks(size_); | |
| std::iota(all_ranks.begin(), all_ranks.end(), 0); | |
| CommunicatorBackend b = backend.value_or(default_backend_); | |
| const std::string prefix = "symm_mem_"; | |
| (void)getBackendForTeam(all_ranks, b, prefix); | |
| return prefix + getTeamKey(all_ranks, b); | |
| } |
csrc/multidevice/communicator.h
Outdated
| c10::intrusive_ptr<c10d::Store> getStore() const { | ||
| return c10::intrusive_ptr<c10d::Store>(store_); | ||
| } |
There was a problem hiding this comment.
getStore() uses non-idiomatic intrusive_ptr construction
c10::intrusive_ptr<c10d::Store>(store_) passes the raw TCPStore* obtained from store_ (via the implicit operator T* of intrusive_ptr) to a new intrusive_ptr<Store>. This calls the unsafe intrusive_ptr<T>(T*, bool) constructor that takes an already-retained raw pointer — but store_ is managed and this path risks a ref-count imbalance.
The idiomatic way is to let the intrusive_ptr copy-conversion handle it:
| c10::intrusive_ptr<c10d::Store> getStore() const { | |
| return c10::intrusive_ptr<c10d::Store>(store_); | |
| } | |
| c10::intrusive_ptr<c10d::Store> getStore() const { | |
| return store_; | |
| } |
| if(is_multicast_setup_==false) { | ||
| SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); | ||
| if (backend != SymmetricMemoryBackend::Native) { | ||
| const std::string group_name = ensurePyTorchSymmMemBackend(backend); | ||
| torch_symm_handle_ = c10d::symmetric_memory::rendezvous( | ||
| local_tensor_, group_name); | ||
| are_remote_tensors_setup_ = true; | ||
| if (torch_symm_handle_->has_multicast_support()) { | ||
| is_multicast_setup_ = true; | ||
| mc_ptr_ = torch_symm_handle_->get_multicast_ptr(); | ||
| } | ||
| return; | ||
| } | ||
| } |
There was a problem hiding this comment.
if(is_multicast_setup_==false) guard is dead code for PyTorch backend
is_multicast_setup_ is never set to true before setupRemoteHandles is called on the PyTorch path: setupMulticast returns unconditionally at line ~615 when torch_symm_handle_ is set, so is_multicast_setup_ remains false. The outer guard is therefore always true and provides no real protection.
The effect is that the rendezvous code is unreachable if any caller were to set is_multicast_setup_ = true first (e.g., through a future code path). The intent—"skip rendezvous if multicast is already fully set up"—is actually achieved by the are_remote_tensors_setup_ early-return at the top of the function, not by this inner guard.
Consider removing this redundant outer condition to make the control flow clearer:
#ifdef NVFUSER_DISTRIBUTED
// PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles).
SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
if (backend != SymmetricMemoryBackend::Native) {
const std::string group_name = ensurePyTorchSymmMemBackend(backend);
torch_symm_handle_ = c10d::symmetric_memory::rendezvous(
local_tensor_, group_name);
are_remote_tensors_setup_ = true;
if (torch_symm_handle_->has_multicast_support()) {
is_multicast_setup_ = true;
mc_ptr_ = torch_symm_handle_->get_multicast_ptr();
}
return;
}
#endif| NVF_THROW( | ||
| false, | ||
| "Contiguous view is not yet supported for PyTorch symmetric memory backend. " | ||
| "Use native backend for SymmetricContiguousView."); | ||
| } |
There was a problem hiding this comment.
NVF_THROW with false as first argument produces a garbled error message
NVF_THROW(...) is an unconditional throw whose variadic arguments are all concatenated into the error message via to_str(__VA_ARGS__). Passing false as the first argument does not act as a condition — it is serialised as part of the message (e.g. "0Contiguous view is not yet...") by to_str. This makes the resulting error message confusing and hard to read in diagnostics.
The same pattern is used again in getContiguousView (line 607–611).
Use NVF_THROW with only the message string, or use the established NVF_ERROR(false, "msg") pattern that is already used elsewhere in this file (e.g. line 74):
| NVF_THROW( | |
| false, | |
| "Contiguous view is not yet supported for PyTorch symmetric memory backend. " | |
| "Use native backend for SymmetricContiguousView."); | |
| } | |
| NVF_THROW( | |
| "Contiguous view is not yet supported for PyTorch symmetric memory backend. " | |
| "Use native backend for SymmetricContiguousView."); |
| case SymmetricMemoryBackend::PyTorchCuda: | ||
| name = "CUDA"; | ||
| break; |
There was a problem hiding this comment.
set_backend is never called for the PyTorchCuda backend
For PyTorchNccl and PyTorchNvshmem, c10d::symmetric_memory::set_backend(name) is called inside the call_once lambda. For PyTorchCuda, name is assigned "CUDA" but set_backend is never invoked. If PyTorch's symmetric-memory layer requires an explicit set_backend call before allocating with a CUDA transport, every empty_strided_p2p call on the CUDA path will either use whatever backend was previously configured (potentially NCCL or NVSHMEM) or fail silently at rendezvous time.
If PyTorchCuda truly requires no set_backend call (e.g., because "CUDA" is the implicit default), please add a comment explaining this so future maintainers don't perceive it as an oversight. Otherwise, add the missing call:
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
c10d::symmetric_memory::set_backend(name);
break;
No description provided.