Skip to content

Remove misleading NCCL kernel protocol suffix#2224

Open
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-drop-proto-suffix
Open

Remove misleading NCCL kernel protocol suffix#2224
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-drop-proto-suffix

Conversation

@sanrise

@sanrise sanrise commented Jun 8, 2026

Copy link
Copy Markdown

Description

Removes the misleading protocol suffix from generated kernel symbols. Every kernel is a universal dispatcher that runs the tuned protocol at runtime through ncclDevFuncTable; only the LL body is inlined (for binary size), so a collective tuned to Simple or LL128 still shows up under a _LL kernel in nsys / torch.profiler, and trace-based analysis attributes performance to the wrong protocol. This drops the protocol component from generated ncclDevKernel_* symbols so the name carries only collective + algorithm (both accurate). The protocol-specific ncclDevFunc* dispatch entries are left intact, so runtime behavior is unchanged. (Same direction as the maintainer suggestion on #2196.)

Related Issues

Part 2 of 2 for #2196. Depends on / should merge after the task protocol trace metadata PR: #2223

Changes & Impact

  • src/device/generate.py: generated ncclDevKernel_* symbols no longer include the protocol component.
  • No runtime or dispatch change; ncclDevFunc* entries are unchanged.
  • Kernel symbol names change, so any tool that parses the protocol out of the kernel name will stop seeing it (that is the intent; the protocol is exposed via the metadata PR instead).

Performance Impact

  • None expected; this is a name-only generator change.
  • Generator output verified: host table has protocol-neutral ncclDevKernel_AllReduce_Sum_u8_RING, the device body still uses NCCL_PROTO_LL, and no _RING_LL host-table symbol remains. Standalone master build passes.

Problem
Generated kernel symbols carry a protocol suffix (e.g.
ncclDevKernel_AllReduce_Sum_f32_RING_LL), but every kernel is a universal
dispatcher that runs the tuned protocol at runtime through ncclDevFuncTable.
Only the LL body is inlined (for binary size), so a collective tuned to Simple
or LL128 still appears under a _LL kernel in nsys and torch.profiler, and
trace-based analysis attributes performance to the wrong protocol.

Solution
Drop the protocol component from generated ncclDevKernel_* symbols so the name
carries only collective and algorithm, both of which are accurate. The
protocol-specific ncclDevFunc* dispatch entries are left intact, so runtime
behavior is unchanged.

This is part 2 of 2 addressing NVIDIA#2196. It should be merged after the task
protocol trace metadata change, which makes the executed protocol observable
before this removes the in-trace suffix.

Addresses NVIDIA#2196

Signed-off-by: Darshan Sanghani <dsang@meta.com>
@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

++ @armratner @gcongiu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants