graph: pass netId (not netDev) to ncclTopoGetIntermediateRank in PXN level 1#2258
Open
EylonKrause wants to merge 1 commit into
Open
graph: pass netId (not netDev) to ncclTopoGetIntermediateRank in PXN level 1#2258EylonKrause wants to merge 1 commit into
EylonKrause wants to merge 1 commit into
Conversation
…level 1 In the NCCL_P2P_PXN_LEVEL=1 branch of ncclTopoGetNetDev, the third arg passed was *dev (the raw net device index) where ncclTopoGetIntermediateRank expects an int64_t netId (used via ncclTopoIdToIndex). netId = (systemId<<56)|dev, so on a multi-system/MNNVL topology (systemId!=0) the lookup fails with ncclInternalError; single-system (systemId==0) masks it since netId==dev. The sibling call (line ~1417) and the PXN level 2 branch already pass netId. Also removes an unguarded *dev deref (dev may be NULL on this path). Signed-off-by: EylonKrause <eylon1909@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In
ncclTopoGetNetDev, theNCCL_P2P_PXN_LEVEL=1branch calls:passing
*dev— the raw network device index just assigned fromnetDev. ButncclTopoGetIntermediateRank's third parameter is anint64_t netId, consumed as a topology id viancclTopoIdToIndex(system, NET, netId, ...). A NET node's id isnetId = NCCL_TOPO_ID(systemId, dev) = (systemId << 56) | dev, which differs from the baredevwheneversystemId != 0(a multi-system / MNNVL fused topology). In that case the lookup finds no matching NET node and returnsncclInternalError, which propagates out ofncclTopoGetNetDevand breaks proxy / intermediate-rank resolution during initialization. On a single-system topologysystemId == 0makesnetId == dev, which masks the bug. The sibling call earlier in the function and the entirePXN_LEVEL=2branch already passnetId.This also removes a latent NULL dereference: a few lines above,
*devis written underif (dev) *dev = netDev;, but thencclTopoGetIntermediateRankcall dereferenced*devunconditionally; callers may passdev == NULLon this path.Related Issues
None.
Changes & Impact
src/graph/search.cc(ncclTopoGetNetDev, PXN level 1): passnetIdinstead of*devtoncclTopoGetIntermediateRank, matching the sibling call site and the PXN level 2 branch. Single-system behavior is unchanged (netId == devthere); the fix corrects intermediate-rank resolution on multi-system / MNNVL topologies and removes the unguarded*devderef. Non-breaking.Performance Impact
None.
Testing
make src.build.netId) and theNCCL_TOPO_ID(systemId, dev)definition. The failure requires a multi-system / MNNVL topology withNCCL_P2P_PXN_LEVEL=1, which cannot be reproduced on a single-GPU machine; on single-system topologiessystemId == 0, so the change is behavior-preserving there.