Conversation
|
Review updated until commit 7283aa8 Description
|
| Relevant files | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 5 files
| ||||||||||
| Configuration changes | |||||||||||
| Documentation | 1 files
| ||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Spin loop performance concern
This can cause high CPU usage and may not be optimal for latency-sensitive workloads. Consider if this should yield to other threads or use a more efficient waiting mechanism. |
samnordmann
left a comment
There was a problem hiding this comment.
Thank you very much! This looks great.
Here are some comments requesting some minor changes or explanation. The only point I'm a bit worried about is that we need a way to make the "wait" not blocking for cpu.
| #endif | ||
| } | ||
|
|
||
| void NixlBackend::Impl::waitTransfer(NixlTransferHandle& handle) { |
There was a problem hiding this comment.
This wait function is cpu blocking so in practice it is more or less unusable in our context. Do you have an idea how to make this not blocking for cpu -- and ideally cuda-graph capturable?
There was a problem hiding this comment.
Definitely important, I suggest we leave it for another PR, to keep this one simple and not too big
|
@x41lakazam Can you provide instructions on how to build nixl, say, from pjnl docker image? We'll probably need to think how to add the library is the base image and/or the CI, unless it is already shipped in some DLFW package |
https://nvidia.slack.com/archives/C08KL9MNQ3U/p1771951941351029 |
|
Note the build error in the CI |
Greptile SummaryThis PR introduces a new NIXL (NVIDIA Interconnect eXchange Layer) communication backend for nvFuser's multi-device infrastructure. It adds a singleton Key additions:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant App
participant NixlBackend
participant nixlAgent
participant TCPStore
participant PeerAgent
App->>NixlBackend: getInstance()
NixlBackend->>nixlAgent: create("rank_N", cfg)
nixlAgent->>nixlAgent: createBackend("UCX")
nixlAgent->>nixlAgent: VRAM probe (registerMem + prepXferDlist)
NixlBackend-->>App: NixlBackend& (or isAvailable=false)
App->>NixlBackend: registerTensors({t1, t2})
NixlBackend->>nixlAgent: registerMem(dlist)
NixlBackend->>nixlAgent: getLocalMD(local_md)
NixlBackend->>TCPStore: set("nixl_agent_md_rank_N", local_md)
NixlBackend->>TCPStore: get("nixl_agent_md_rank_Peer")
TCPStore-->>NixlBackend: remote_md (blocks until peer sets key)
NixlBackend->>nixlAgent: loadRemoteMD(remote_md)
NixlBackend->>NixlBackend: barrier() [all ranks sync]
NixlBackend->>TCPStore: deleteKey("nixl_agent_md_rank_N")
App->>App: Exchange TensorDescs via TCPStore
App->>NixlBackend: prepareTransfer(local_descs, remote_descs, op)
NixlBackend->>nixlAgent: createXferReq(op, local_dlist, remote_dlist, peer_agent)
NixlBackend-->>App: NixlTransferHandle
App->>NixlBackend: postTransfer(handle)
NixlBackend->>nixlAgent: postXferReq(xfer_handle)
App->>NixlBackend: waitTransfer(handle)
loop poll until NIXL_SUCCESS
NixlBackend->>nixlAgent: getXferStatus(xfer_handle)
nixlAgent-->>NixlBackend: NIXL_IN_PROG / NIXL_SUCCESS
NixlBackend->>NixlBackend: this_thread::yield()
end
NixlBackend-->>App: done (posted=false)
App->>NixlBackend: deregisterTensors({t1, t2})
NixlBackend->>nixlAgent: deregisterMem(dlist)
NixlBackend->>NixlBackend: exchangeMetadata() [collective]
|
mdavis36
left a comment
There was a problem hiding this comment.
Thank you for addressing the CMake changes so quickly, just a few more comments but overall looks good. When you get a chance would you be able to add screenshots / examples of the pretty report output for a few cases for NIXL e.g. (success, wrong cuda version & library not found at all)
|
!test |
@xwang233 could you please add permission to @x41lakazam to launch CI? Or indicate how to do? |
|
!test |
|
!test |
|
!build |
|
!test |
|
!test |
|
!test |
No description provided.