init: OR-accumulate hasMultiRankNvml across all rank pairs by EylonKrause · Pull Request #2257 · NVIDIA/nccl

EylonKrause · 2026-06-26T17:29:06Z

Description

In initTransportsRank, comm->hasMultiRankNvml is assigned with = inside the nested rank-pair loop:

for (int i = 0; i < nranks; i++) { ...
  for (int j = 0; j < i; j++) {
    comm->hasMultiRankNvml = (comm->peerInfo[i].hostHash == comm->peerInfo[j].hostHash) &&
                             (comm->peerInfo[i].nvmlDev == comm->peerInfo[j].nvmlDev);
    ...
  }
}

Each iteration overwrites the field, so after the loops it reflects only the last evaluated pair (i=nranks-1, j=nranks-2), discarding every earlier pair. The field is documented as "if multiple ranks are using the NVML device" — a global OR over all pairs — and it gates NVLS setup: ncclNvlsInit force-disables NVLS (and makes NCCL_NVLS_ENABLE=1 an error) when it is set. With the bug, a configuration where a non-terminal pair of ranks shares host+nvmlDev but the last pair does not is mis-detected as false (NVLS not disabled when it should be), and vice-versa. The pre-refactor isMultiRankGpu check was a monotonic OR; the refactor replaced it with a per-iteration =, so this is a regression of the original semantics.

Related Issues

None.

Changes & Impact

src/init.cc (initTransportsRank): use |= so the flag OR-accumulates across all rank pairs. comm is ncclCalloc-zeroed, so it starts false and the accumulation is well-defined. Restores the documented global-OR semantics. The common case (no ranks sharing a device) is unchanged; only multi-rank-per-NVML-device configurations (the NCCL_MULTI_RANK_GPU_ENABLE case) are affected, and in the intended direction. Non-breaking.

Performance Impact

None.

Testing

Builds clean with make src.build.
The defect and fix are evident from the loop structure (= overwrites each iteration; |= accumulates) and match the pre-refactor isMultiRankGpu OR semantics. Demonstrating the downstream NVLS effect at runtime requires multiple ranks sharing one NVML device, i.e. multi-rank hardware.

hasMultiRankNvml was assigned with = inside the nested rank-pair loop, so after the loop it reflected only the last pair (i=nranks-1, j=nranks-2) instead of whether ANY pair shares host+nvmlDev. The flag gates NVLS setup (ncclNvlsInit force-disables NVLS when set), so a multi-rank-per-NVML-device config could be mis-detected. Restore the OR-reduction semantics of the pre-refactor isMultiRankGpu check with |= (comm is calloc-zeroed). Signed-off-by: EylonKrause <eylon1909@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

init: OR-accumulate hasMultiRankNvml across all rank pairs#2257

init: OR-accumulate hasMultiRankNvml across all rank pairs#2257
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/multirank-nvml-accumulate

EylonKrause commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EylonKrause commented Jun 26, 2026

Description

Related Issues

Changes & Impact

Performance Impact

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant