Skip to content

graph: pass netId (not netDev) to ncclTopoGetIntermediateRank in PXN level 1#2258

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/topo-getnetdev-netid
Open

graph: pass netId (not netDev) to ncclTopoGetIntermediateRank in PXN level 1#2258
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/topo-getnetdev-netid

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

In ncclTopoGetNetDev, the NCCL_P2P_PXN_LEVEL=1 branch calls:

NCCLCHECK(ncclTopoGetIntermediateRank(comm->topo, rank, *dev, proxyRank));

passing *dev — the raw network device index just assigned from netDev. But ncclTopoGetIntermediateRank's third parameter is an int64_t netId, consumed as a topology id via ncclTopoIdToIndex(system, NET, netId, ...). A NET node's id is netId = NCCL_TOPO_ID(systemId, dev) = (systemId << 56) | dev, which differs from the bare dev whenever systemId != 0 (a multi-system / MNNVL fused topology). In that case the lookup finds no matching NET node and returns ncclInternalError, which propagates out of ncclTopoGetNetDev and breaks proxy / intermediate-rank resolution during initialization. On a single-system topology systemId == 0 makes netId == dev, which masks the bug. The sibling call earlier in the function and the entire PXN_LEVEL=2 branch already pass netId.

This also removes a latent NULL dereference: a few lines above, *dev is written under if (dev) *dev = netDev;, but the ncclTopoGetIntermediateRank call dereferenced *dev unconditionally; callers may pass dev == NULL on this path.

Related Issues

None.

Changes & Impact

  • src/graph/search.cc (ncclTopoGetNetDev, PXN level 1): pass netId instead of *dev to ncclTopoGetIntermediateRank, matching the sibling call site and the PXN level 2 branch. Single-system behavior is unchanged (netId == dev there); the fix corrects intermediate-rank resolution on multi-system / MNNVL topologies and removes the unguarded *dev deref. Non-breaking.

Performance Impact

None.

Testing

  • Builds clean with make src.build.
  • Verified by inspection against the two correct call sites (the sibling call and the PXN level 2 branch both pass netId) and the NCCL_TOPO_ID(systemId, dev) definition. The failure requires a multi-system / MNNVL topology with NCCL_P2P_PXN_LEVEL=1, which cannot be reproduced on a single-GPU machine; on single-system topologies systemId == 0, so the change is behavior-preserving there.

…level 1

In the NCCL_P2P_PXN_LEVEL=1 branch of ncclTopoGetNetDev, the third arg
passed was *dev (the raw net device index) where ncclTopoGetIntermediateRank
expects an int64_t netId (used via ncclTopoIdToIndex). netId = (systemId<<56)|dev,
so on a multi-system/MNNVL topology (systemId!=0) the lookup fails with
ncclInternalError; single-system (systemId==0) masks it since netId==dev.
The sibling call (line ~1417) and the PXN level 2 branch already pass netId.
Also removes an unguarded *dev deref (dev may be NULL on this path).

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant