Skip to content

Add NCCL task protocol trace metadata#2223

Open
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-task-proto-metadata
Open

Add NCCL task protocol trace metadata#2223
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-task-proto-metadata

Conversation

@sanrise

@sanrise sanrise commented Jun 8, 2026

Copy link
Copy Markdown

Description

Adds per-task NVTX protocol metadata so the executed NCCL protocol (LL / LL128 / Simple) is visible in profiles. Today the kernel symbol is named after the inlined LL body, not the tuned protocol, so the only way to see the real protocol is NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling. This pushes one NVTX range per scheduled collective task, named with its protocol, plus a compact structured payload (comm, rank, bytes, collective, algorithm, protocol, dtype, root, channels) stored as enum integers to keep the launch path cheap. nsys then reports the executed protocol per task, even when protocols are fused into one launch.

torch.profiler / Kineto does not decode external NVTX payloads; those consumers should read protocol from the NCCL profiler plugin (unaffected here).

Related Issues

Part 1 of 2 for #2196. Part 2 removes the misleading kernel protocol suffix and should merge after this: #2224

Changes & Impact

  • Adds an internal NVTX schema and a per-task range + structured payload pushed at launch (src/enqueue.cc, src/include/nvtx.h, src/include/nvtx_payload_schemas.h).
  • No public API changes.
  • No effect when NVTX is disabled (early return on ncclParamNvtxDisable).

Performance Impact

  • Binary size: libnccl.so +~11 KB (+0.02%); dynamic symbol count unchanged.
  • all_reduce_perf 8 B–64 MiB, forced Simple/LL/LL128: flat vs baseline.
  • Correctness check: forced NCCL_PROTO shows the matching NVTX range, and the payload protocol byte matches the enum (02 Simple, 00 LL, 01 LL128).

Problem
The executed protocol (LL/LL128/Simple) of a collective is not visible in
profiles. The kernel symbol is named after the inlined LL body, not the tuned
protocol, and the only way to see the real protocol today is
NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling.

Solution
Push one NVTX range per scheduled collective task, named with its protocol, plus
a compact structured payload (comm, rank, bytes, collective, algorithm,
protocol, dtype, root, channels) stored as enum integers to keep the launch path
cheap. nsys then reports the executed protocol per task, even when multiple
protocols are fused into one launch.

Limitations
This surfaces the protocol in nsys. torch.profiler / Kineto does not decode
external NVTX payloads; those consumers should read the protocol from the NCCL
profiler plugin instead.

This is part 1 of 2 addressing NVIDIA#2196. Part 2 removes the misleading protocol
suffix from kernel symbols and should be merged after this change, so the
executed protocol stays observable once the suffix is gone.

Addresses NVIDIA#2196

Signed-off-by: Darshan Sanghani <dsang@meta.com>
@sanrise

sanrise commented Jun 17, 2026

Copy link
Copy Markdown
Author

@xiaofanl-nvidia any comments/feedback on this?

@MoraruMaxim

Copy link
Copy Markdown

I have a few comments. The answer to some of them can be just "we are ok with that"

  • NCCL already has NVTX instrumentation at the API level, so with this PR we would have some duplication (e.g., the communicator id, message size). It might actually make sense to duplicate some of this information for clarity, but we need to decide.
  • Currently the label only contains the protocol name. We should decide what the label must indicate. Additionally, we should use the existing ncclProtoToString function (instead of adding code duplication with ncclNvtxProtoRangeName).
  • The current PR does not expose the redop in the kernel task payload, even though it is exposed by the existing API-level NVTX instrumentation in NCCL (for AllReduce, Reduce, ReduceScatter).
  • We should run more experiments to evaluate the profiling overhead, in particular at small message sizes.

@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

@MoraruMaxim please work with @armratner to provide feedback to the PR author to improve and get to a merge-able state. Then you can mirror it for merging. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants