Fix/trace2tree: fix cross-rank GPU attribution and merged-trace hang by brieflynn · Pull Request #577 · AMD-AGI/TraceLens

brieflynn · 2026-04-01T23:10:27Z

The problem

When PyTorch profiles a single rank, each GPU kernel launch is assigned a correlation ID unique within that session. When traces from K ranks are merged into one file, each rank's correlation IDs restart from the same numeric range, causing collisions. add_gpu_ops_to_tree uses these correlation IDs to link CPU runtime events to their GPU kernels, and was vulnerable to these collisions in two ways.

First, _get_graph_gpu_events looked up GPU kernels by correlation ID alone, so in a merged trace every graph launch event claimed kernels from all K ranks sharing that ID, producing incorrect attribution.

Second, the per-kernel ancestor walk had no mechanism to deduplicate GPU events that appeared as children of multiple runtime parents, causing $O(K^2 \times N_{gpu})$ complexity during propagation.

Together, these were causing issues such as incorrect cross-rank GPU attribution and indefinite hangs when processing merged trace files from PyTorch profiling of multi-node workloads.

This PR

Fixes two bugs in TraceToTree.add_gpu_ops_to_tree that caused indefinite hangs and incorrect GPU event attribution when processing large merged multi-rank traces, and removes a related $O(subtree)$ per-launcher traversal in TreePerfAnalyzer.

Three changes

Cross-rank GPU attribution for graph GPU events in trace_to_tree.py

_get_graph_gpu_events looked up GPU kernels by correlation ID alone via linking_id_to_gpu_events[corr] (introduced in #522). In merged traces this bucket contains GPU kernels from all K ranks sharing that correlation, so every graph launch was incorrectly claiming foreign-rank kernels as its own, inflating gpu_events, total_subtree_kernel_time, and kernel_details with cross-rank data.

$O(subtree \times N_{launchers})$ traversal in tree_perf.py

get_kernel_launchers computed subtree GPU time by calling _compute_subtree_kernel_time_us(event), which called loop_and_aggregate_kernels (a full recursive subtree traversal) for every launcher. Since add_gpu_ops_to_tree already propagates all GPU kernel UIDs up to every ancestor via event["gpu_events"], this is now an $O(1)$ field lookup.

$O(K^2 \times N_{gpu})$ hang in trace_to_tree.py

In merged K-rank traces, correlation ID collisions caused GPU kernels from all K ranks to become linked as children of every runtime event sharing that correlation. The ancestor walk then ran for all of those cross-rank kernels, producing $O(K \times N_{gpu} \times depth)$ individual list.append() calls, causing an indefinite hang.

Replaced the per-kernel ancestor walk with a single BFS topological sort seeded from cpu_root_nodes, followed by a reverse-order list.extend() propagation pass. A visited set ensures each event is processed exactly once regardless of how many parents claim it, collapsing traversal from $O(K^2 \times N_{gpu})$ to $O(N)$. GC is disabled during propagation to eliminate cyclic collector overhead.

Testing

46/46 existing tests pass
Manually validated on a large merged multi-rank trace that previously hung indefinitely

Result

Large multi-node merged PyTorch trace file that previously ran indefinitely now runs in 235.9 seconds.

Pull Request Template

Note to AMDers:
This is a public repository. Please do not upload any confidential or customer data. Make sure all such data has been anonymized or removed before making this PR. If you need to attach any private files or links, please insert a Internal OneDrive Link or a Jira Ticket Link instead.

… gpu_events lookup

…fail

…us descending

ajassani · 2026-04-09T15:55:37Z

Thanks for the detailed write-up and the performance investigation.

On fixes 1 and 2 (cross-rank attribution and the hang):

As noted in the TraceFusion docs, merged traces produced by TraceFusion are intended only for visual analysis in Perfetto, not for automated analysis. The tree perf / trace2tree pipeline operates on single-rank traces. The intended workflow for multi-rank analysis is to generate perf reports per rank individually, then analyze or compare the resulting report sheets.

Rather than making the trace2tree internals handle merged-trace correlation collisions, I think the better approach would be to detect merged/multi-rank input early and surface a clear error pointing users to the per-rank workflow. That keeps the single-rank code path simple.

On fix 3 (O(subtree) → O(1) lookup in tree_perf.py):

This is a clean win — the propagated gpu_events field already has the data, so dropping the redundant recursive traversal benefits the normal single-rank case. Happy to take this independently.

Suggestion: Could we split this into two PRs? Land the tree_perf.py optimization (fix 3) on its own, and for the merged-trace scenario, let's discuss whether we want to add an early detection/warning at the entry point instead.

brieflynn · 2026-04-09T18:04:25Z

Hey @ajassani, thank you for your review! Yes I think your comments make sense, I can turn this into two PRs so that fix 3 is independent of fix 1 / 2. Thank you!

@ajassani

# $O(subtree \times N_{launchers})$ traversal in `tree_perf.py` `get_kernel_launchers` computed subtree GPU time by calling `_compute_subtree_kernel_time_us(event)`, which called `loop_and_aggregate_kernels` (a full recursive subtree traversal) *for every launcher*. Since `add_gpu_ops_to_tree` already propagates all GPU kernel UIDs up to every ancestor via event["gpu_events"], this is now an $O(1)$ field lookup. Split PR at request of @ajassani #577 (comment)  # Pull Request Template > **Note to AMDers:** > This is a public repository. Please do **not** upload any confidential or customer data. Make sure all such data has been anonymized or removed before making this PR. If you need to attach any private files or links, please insert a Internal OneDrive Link or a Jira Ticket Link instead.

brieflynn added 7 commits April 1, 2026 21:27

fix(trace2tree): fix cross-rank GPU attribution in add_gpu_ops_to_tree

c9aa422

performance - tree_perf: replace loop_and_aggregate_kernels with O(1)…

58cbe6d

… gpu_events lookup

fix non-deterministic kernel ordering causing CI regression tests to …

92d1a46

…fail

linting

02dc26d

sort final summary_list in _summarize_kernel_stats by total_duration_…

7b73950

…us descending

changed to ascending order

d58dbc7

add reference files

f84c2ac

brieflynn mentioned this pull request Apr 3, 2026

Bug: Incorrect cross-rank GPU attribution during multinode merged trace analysis, indefinite hangs #581

Open

brieflynn mentioned this pull request Apr 10, 2026

Fix/tree perf/subtree traversal optimization #589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/trace2tree: fix cross-rank GPU attribution and merged-trace hang#577

Fix/trace2tree: fix cross-rank GPU attribution and merged-trace hang#577
brieflynn wants to merge 7 commits intomainfrom
fix/trace2tree/multinode-merged-trace-hang

brieflynn commented Apr 1, 2026 •

edited

Loading

Uh oh!

ajassani commented Apr 9, 2026 •

edited

Loading

Uh oh!

brieflynn commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brieflynn commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem

This PR

Three changes

Testing

Result

Pull Request Template

Uh oh!

ajassani commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brieflynn commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brieflynn commented Apr 1, 2026 •

edited

Loading

ajassani commented Apr 9, 2026 •

edited

Loading