Skip to content

init: OR-accumulate hasMultiRankNvml across all rank pairs#2257

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/multirank-nvml-accumulate
Open

init: OR-accumulate hasMultiRankNvml across all rank pairs#2257
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/multirank-nvml-accumulate

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

In initTransportsRank, comm->hasMultiRankNvml is assigned with = inside the nested rank-pair loop:

for (int i = 0; i < nranks; i++) { ...
  for (int j = 0; j < i; j++) {
    comm->hasMultiRankNvml = (comm->peerInfo[i].hostHash == comm->peerInfo[j].hostHash) &&
                             (comm->peerInfo[i].nvmlDev == comm->peerInfo[j].nvmlDev);
    ...
  }
}

Each iteration overwrites the field, so after the loops it reflects only the last evaluated pair (i=nranks-1, j=nranks-2), discarding every earlier pair. The field is documented as "if multiple ranks are using the NVML device" — a global OR over all pairs — and it gates NVLS setup: ncclNvlsInit force-disables NVLS (and makes NCCL_NVLS_ENABLE=1 an error) when it is set. With the bug, a configuration where a non-terminal pair of ranks shares host+nvmlDev but the last pair does not is mis-detected as false (NVLS not disabled when it should be), and vice-versa. The pre-refactor isMultiRankGpu check was a monotonic OR; the refactor replaced it with a per-iteration =, so this is a regression of the original semantics.

Related Issues

None.

Changes & Impact

  • src/init.cc (initTransportsRank): use |= so the flag OR-accumulates across all rank pairs. comm is ncclCalloc-zeroed, so it starts false and the accumulation is well-defined. Restores the documented global-OR semantics. The common case (no ranks sharing a device) is unchanged; only multi-rank-per-NVML-device configurations (the NCCL_MULTI_RANK_GPU_ENABLE case) are affected, and in the intended direction. Non-breaking.

Performance Impact

None.

Testing

  • Builds clean with make src.build.
  • The defect and fix are evident from the loop structure (= overwrites each iteration; |= accumulates) and match the pre-refactor isMultiRankGpu OR semantics. Demonstrating the downstream NVLS effect at runtime requires multiple ranks sharing one NVML device, i.e. multi-rank hardware.

hasMultiRankNvml was assigned with = inside the nested rank-pair loop, so
after the loop it reflected only the last pair (i=nranks-1, j=nranks-2)
instead of whether ANY pair shares host+nvmlDev. The flag gates NVLS setup
(ncclNvlsInit force-disables NVLS when set), so a multi-rank-per-NVML-device
config could be mis-detected. Restore the OR-reduction semantics of the
pre-refactor isMultiRankGpu check with |= (comm is calloc-zeroed).

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant