Fix/tree perf/subtree traversal optimization#589
Conversation
… gpu_events lookup
There was a problem hiding this comment.
Overall the core optimization is correct and well-motivated — the recursive _compute_subtree_kernel_time_us traversal was redundant since gpu_events is already propagated by add_gpu_ops_to_tree. CI passes and the logic matches patterns used throughout the rest of the file. Nice work.
However, I'd like to avoid changing the 53 reference CSVs. The sorting additions for determinism are good, but we can make the test comparison order-insensitive instead of regenerating all the golden files. I verified this locally — all 13 regression tests pass with main's reference CSVs after this one change.
Requested changes
1. Add order-insensitive comparison for kernel detail columns in tests/conftest.py
In normalize_value, change:
elif isinstance(val, list):
return [normalize_value(v) for v in val]to:
elif isinstance(val, list):
normalized = [normalize_value(v) for v in val]
if normalized and all(isinstance(v, dict) for v in normalized):
normalized.sort(key=lambda d: str(sorted(d.items())))
return normalizedThis canonically sorts lists of dicts (kernel_details, trunc_kernel_details, etc.) before comparison, making the tests insensitive to kernel ordering within those columns. Only affects lists-of-dicts — lists of strings/numbers/lists are left alone.
2. Revert all reference CSV changes back to main
git checkout origin/main -- tests/traces/With change #1, the tests pass without any reference file modifications.
…n, making kernel ordering irrelevant, reference csv files reverted to main
|
Hi @ajassani thank you for your review and comments, I have made the requested changes and reverted the reference CSV changes back to main |
tree_perf.pyget_kernel_launcherscomputed subtree GPU time by calling_compute_subtree_kernel_time_us(event), which calledloop_and_aggregate_kernels(a full recursive subtree traversal) for every launcher. Sinceadd_gpu_ops_to_treealready propagates all GPU kernel UIDs up to every ancestor via event["gpu_events"], this is now anSplit PR at request of @ajassani #577 (comment)
Pull Request Template