Remove misleading NCCL kernel protocol suffix by sanrise · Pull Request #2224 · NVIDIA/nccl

sanrise · 2026-06-08T23:30:47Z

Description

Removes the misleading protocol suffix from generated kernel symbols. Every kernel is a universal dispatcher that runs the tuned protocol at runtime through ncclDevFuncTable; only the LL body is inlined (for binary size), so a collective tuned to Simple or LL128 still shows up under a _LL kernel in nsys / torch.profiler, and trace-based analysis attributes performance to the wrong protocol. This drops the protocol component from generated ncclDevKernel_* symbols so the name carries only collective + algorithm (both accurate). The protocol-specific ncclDevFunc* dispatch entries are left intact, so runtime behavior is unchanged. (Same direction as the maintainer suggestion on #2196.)

Related Issues

Part 2 of 2 for #2196. Depends on / should merge after the task protocol trace metadata PR: #2223

Changes & Impact

src/device/generate.py: generated ncclDevKernel_* symbols no longer include the protocol component.
No runtime or dispatch change; ncclDevFunc* entries are unchanged.
Kernel symbol names change, so any tool that parses the protocol out of the kernel name will stop seeing it (that is the intent; the protocol is exposed via the metadata PR instead).

Performance Impact

None expected; this is a name-only generator change.
Generator output verified: host table has protocol-neutral ncclDevKernel_AllReduce_Sum_u8_RING, the device body still uses NCCL_PROTO_LL, and no _RING_LL host-table symbol remains. Standalone master build passes.

Problem Generated kernel symbols carry a protocol suffix (e.g. ncclDevKernel_AllReduce_Sum_f32_RING_LL), but every kernel is a universal dispatcher that runs the tuned protocol at runtime through ncclDevFuncTable. Only the LL body is inlined (for binary size), so a collective tuned to Simple or LL128 still appears under a _LL kernel in nsys and torch.profiler, and trace-based analysis attributes performance to the wrong protocol. Solution Drop the protocol component from generated ncclDevKernel_* symbols so the name carries only collective and algorithm, both of which are accurate. The protocol-specific ncclDevFunc* dispatch entries are left intact, so runtime behavior is unchanged. This is part 2 of 2 addressing NVIDIA#2196. It should be merged after the task protocol trace metadata change, which makes the executed protocol observable before this removes the in-trace suffix. Addresses NVIDIA#2196 Signed-off-by: Darshan Sanghani <dsang@meta.com>

xiaofanl-nvidia · 2026-06-23T00:33:02Z

++ @armratner @gcongiu

This was referenced Jun 8, 2026

Add NCCL task protocol trace metadata #2223

Open

[RFE]: Misleading RING_LL suffix in AllGather/ReduceScatter kernel names #2196

Open

xiaofanl-nvidia requested review from armratner and gcongiu June 23, 2026 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove misleading NCCL kernel protocol suffix#2224

Remove misleading NCCL kernel protocol suffix#2224
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-drop-proto-suffix

sanrise commented Jun 8, 2026 •

edited

Loading

Uh oh!

xiaofanl-nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sanrise commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

xiaofanl-nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sanrise commented Jun 8, 2026 •

edited

Loading