Add NCCL task protocol trace metadata#2223
Open
sanrise wants to merge 1 commit into
Open
Conversation
Problem The executed protocol (LL/LL128/Simple) of a collective is not visible in profiles. The kernel symbol is named after the inlined LL body, not the tuned protocol, and the only way to see the real protocol today is NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling. Solution Push one NVTX range per scheduled collective task, named with its protocol, plus a compact structured payload (comm, rank, bytes, collective, algorithm, protocol, dtype, root, channels) stored as enum integers to keep the launch path cheap. nsys then reports the executed protocol per task, even when multiple protocols are fused into one launch. Limitations This surfaces the protocol in nsys. torch.profiler / Kineto does not decode external NVTX payloads; those consumers should read the protocol from the NCCL profiler plugin instead. This is part 1 of 2 addressing NVIDIA#2196. Part 2 removes the misleading protocol suffix from kernel symbols and should be merged after this change, so the executed protocol stays observable once the suffix is gone. Addresses NVIDIA#2196 Signed-off-by: Darshan Sanghani <dsang@meta.com>
This was referenced Jun 8, 2026
Author
|
@xiaofanl-nvidia any comments/feedback on this? |
|
I have a few comments. The answer to some of them can be just "we are ok with that"
|
Collaborator
|
@MoraruMaxim please work with @armratner to provide feedback to the PR author to improve and get to a merge-able state. Then you can mirror it for merging. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds per-task NVTX protocol metadata so the executed NCCL protocol (LL / LL128 / Simple) is visible in profiles. Today the kernel symbol is named after the inlined LL body, not the tuned protocol, so the only way to see the real protocol is
NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling. This pushes one NVTX range per scheduled collective task, named with its protocol, plus a compact structured payload (comm, rank, bytes, collective, algorithm, protocol, dtype, root, channels) stored as enum integers to keep the launch path cheap.nsysthen reports the executed protocol per task, even when protocols are fused into one launch.torch.profiler/ Kineto does not decode external NVTX payloads; those consumers should read protocol from the NCCL profiler plugin (unaffected here).Related Issues
Part 1 of 2 for #2196. Part 2 removes the misleading kernel protocol suffix and should merge after this: #2224
Changes & Impact
src/enqueue.cc,src/include/nvtx.h,src/include/nvtx_payload_schemas.h).ncclParamNvtxDisable).Performance Impact
libnccl.so+~11 KB (+0.02%); dynamic symbol count unchanged.all_reduce_perf8 B–64 MiB, forced Simple/LL/LL128: flat vs baseline.NCCL_PROTOshows the matching NVTX range, and the payload protocol byte matches the enum (02Simple,00LL,01LL128).