Remove misleading NCCL kernel protocol suffix#2224
Open
sanrise wants to merge 1 commit into
Open
Conversation
Problem Generated kernel symbols carry a protocol suffix (e.g. ncclDevKernel_AllReduce_Sum_f32_RING_LL), but every kernel is a universal dispatcher that runs the tuned protocol at runtime through ncclDevFuncTable. Only the LL body is inlined (for binary size), so a collective tuned to Simple or LL128 still appears under a _LL kernel in nsys and torch.profiler, and trace-based analysis attributes performance to the wrong protocol. Solution Drop the protocol component from generated ncclDevKernel_* symbols so the name carries only collective and algorithm, both of which are accurate. The protocol-specific ncclDevFunc* dispatch entries are left intact, so runtime behavior is unchanged. This is part 2 of 2 addressing NVIDIA#2196. It should be merged after the task protocol trace metadata change, which makes the executed protocol observable before this removes the in-trace suffix. Addresses NVIDIA#2196 Signed-off-by: Darshan Sanghani <dsang@meta.com>
This was referenced Jun 8, 2026
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Removes the misleading protocol suffix from generated kernel symbols. Every kernel is a universal dispatcher that runs the tuned protocol at runtime through
ncclDevFuncTable; only the LL body is inlined (for binary size), so a collective tuned to Simple or LL128 still shows up under a_LLkernel innsys/torch.profiler, and trace-based analysis attributes performance to the wrong protocol. This drops the protocol component from generatedncclDevKernel_*symbols so the name carries only collective + algorithm (both accurate). The protocol-specificncclDevFunc*dispatch entries are left intact, so runtime behavior is unchanged. (Same direction as the maintainer suggestion on #2196.)Related Issues
Part 2 of 2 for #2196. Depends on / should merge after the task protocol trace metadata PR: #2223
Changes & Impact
src/device/generate.py: generatedncclDevKernel_*symbols no longer include the protocol component.ncclDevFunc*entries are unchanged.Performance Impact
ncclDevKernel_AllReduce_Sum_u8_RING, the device body still usesNCCL_PROTO_LL, and no_RING_LLhost-table symbol remains. Standalone master build passes.