Add NCCL task protocol trace metadata by sanrise · Pull Request #2223 · NVIDIA/nccl

sanrise · 2026-06-08T23:30:05Z

Description

Adds per-task NVTX protocol metadata so the executed NCCL protocol (LL / LL128 / Simple) is visible in profiles. Today the kernel symbol is named after the inlined LL body, not the tuned protocol, so the only way to see the real protocol is NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling. This pushes one NVTX range per scheduled collective task, named with its protocol, plus a compact structured payload (comm, rank, bytes, collective, algorithm, protocol, dtype, root, channels) stored as enum integers to keep the launch path cheap. nsys then reports the executed protocol per task, even when protocols are fused into one launch.

torch.profiler / Kineto does not decode external NVTX payloads; those consumers should read protocol from the NCCL profiler plugin (unaffected here).

Related Issues

Part 1 of 2 for #2196. Part 2 removes the misleading kernel protocol suffix and should merge after this: #2224

Changes & Impact

Adds an internal NVTX schema and a per-task range + structured payload pushed at launch (src/enqueue.cc, src/include/nvtx.h, src/include/nvtx_payload_schemas.h).
No public API changes.
No effect when NVTX is disabled (early return on ncclParamNvtxDisable).

Performance Impact

Binary size: libnccl.so +~11 KB (+0.02%); dynamic symbol count unchanged.
all_reduce_perf 8 B–64 MiB, forced Simple/LL/LL128: flat vs baseline.
Correctness check: forced NCCL_PROTO shows the matching NVTX range, and the payload protocol byte matches the enum (02 Simple, 00 LL, 01 LL128).

Problem The executed protocol (LL/LL128/Simple) of a collective is not visible in profiles. The kernel symbol is named after the inlined LL body, not the tuned protocol, and the only way to see the real protocol today is NCCL_DEBUG_SUBSYS=TUNING, which is too noisy for normal profiling. Solution Push one NVTX range per scheduled collective task, named with its protocol, plus a compact structured payload (comm, rank, bytes, collective, algorithm, protocol, dtype, root, channels) stored as enum integers to keep the launch path cheap. nsys then reports the executed protocol per task, even when multiple protocols are fused into one launch. Limitations This surfaces the protocol in nsys. torch.profiler / Kineto does not decode external NVTX payloads; those consumers should read the protocol from the NCCL profiler plugin instead. This is part 1 of 2 addressing NVIDIA#2196. Part 2 removes the misleading protocol suffix from kernel symbols and should be merged after this change, so the executed protocol stays observable once the suffix is gone. Addresses NVIDIA#2196 Signed-off-by: Darshan Sanghani <dsang@meta.com>

sanrise · 2026-06-17T05:58:09Z

@xiaofanl-nvidia any comments/feedback on this?

MoraruMaxim · 2026-06-22T16:14:02Z

I have a few comments. The answer to some of them can be just "we are ok with that"

NCCL already has NVTX instrumentation at the API level, so with this PR we would have some duplication (e.g., the communicator id, message size). It might actually make sense to duplicate some of this information for clarity, but we need to decide.
Currently the label only contains the protocol name. We should decide what the label must indicate. Additionally, we should use the existing ncclProtoToString function (instead of adding code duplication with ncclNvtxProtoRangeName).
The current PR does not expose the redop in the kernel task payload, even though it is exposed by the existing API-level NVTX instrumentation in NCCL (for AllReduce, Reduce, ReduceScatter).
We should run more experiments to evaluate the profiling overhead, in particular at small message sizes.

xiaofanl-nvidia · 2026-06-23T00:32:16Z

@MoraruMaxim please work with @armratner to provide feedback to the PR author to improve and get to a merge-able state. Then you can mirror it for merging. Thanks!

This was referenced Jun 8, 2026

Remove misleading NCCL kernel protocol suffix #2224

Open

[RFE]: Misleading RING_LL suffix in AllGather/ReduceScatter kernel names #2196

Open

xiaofanl-nvidia requested a review from armratner June 23, 2026 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NCCL task protocol trace metadata#2223

Add NCCL task protocol trace metadata#2223
sanrise wants to merge 1 commit into
NVIDIA:masterfrom
sanrise:nccl-task-proto-metadata

sanrise commented Jun 8, 2026 •

edited

Loading

Uh oh!

sanrise commented Jun 17, 2026

Uh oh!

MoraruMaxim commented Jun 22, 2026

Uh oh!

xiaofanl-nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sanrise commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

sanrise commented Jun 17, 2026

Uh oh!

MoraruMaxim commented Jun 22, 2026

Uh oh!

xiaofanl-nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanrise commented Jun 8, 2026 •

edited

Loading