add transpose benchmark for regs -> smem transpose by liqiangxl · Pull Request #5927 · NVIDIA/Fuser

liqiangxl · 2026-02-06T14:13:52Z

Issue
The transpose scheduler organizes input and output tensors into two groups, based on their vectorizable inner dimensions. Tensors in the second group are smaller in byte size, and their cached input/output versions are stored in shared memory to achieve transpose. This ensures the smallest shared memory size is used.

For input tensors cached in shared memory, the data transpose occurs during the load from shared memory to registers:

Input → smem → (transpose) registers → computations → output

For output tensors cached in shared memory, the data transpose occurs when registers write the results to shared memory:

Input → registers → computations -> regs→ (transpose) smem → output

Current benchmark only covers the 2nd case, it has two inputs and one ouput, the output cache is stored in smem and transposed during smem write.

Fix:
This PR adds an additional test with 1 input and 1 output, in this case, the input cache is stored in smem and transposed during smem read.

Current performance of these two types of transpose is in this doc.

github-actions · 2026-02-06T14:15:06Z

Review updated until commit 9478d11

Description

Add new test parameter num_inputs with values [1, 2] to support both single and dual input transpose scenarios
Modify transpose_fusion function to conditionally create one or two input tensors based on num_inputs parameter
Update transpose_fwd_fn to handle both 1-input and 2-input cases in the baseline benchmark function
Extend both test_transpose_nvf_benchmark and test_transpose_baseline_benchmark to test input cache transpose (smem→regs) and output cache transpose (regs→smem) scenarios

Changes walkthrough

Relevant files

Tests

test_transpose.py `Extend transpose benchmark to cover input cache transpose` benchmarks/python/test_transpose.py Added `num_inputs` parameter to `transpose_fusion` function with default value 2 Modified tensor creation logic to conditionally create one or two input tensors Updated `transpose_fwd_fn` to handle both single and dual input scenarios Added new test parameters for 1-input and 2-input test cases Updated both nvFuser and baseline benchmark functions to support new test scenarios	+65/-20

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ No major issues detected

greptile-apps · 2026-02-06T14:17:47Z

Greptile Overview

Greptile Summary

Added num_inputs parameter to transpose benchmark tests to enable testing both transpose patterns: (1) input cached in smem with transpose during smem→regs load, and (2) output cached in smem with transpose during regs→smem write.

Key changes:

Modified transpose_fusion() to accept num_inputs parameter (1 or 2), conditionally creating second input tensor and add operation only when num_inputs == 2
Updated transpose_fwd_fn() to handle variable input lists, parsing num_inputs from last element and adjusting dim indices accordingly
Added @pytest.mark.parametrize decorator for num_inputs parameter with values [1, 2] to both test functions
Adjusted input tensor creation and validation logic in both test_transpose_nvf_benchmark() and test_transpose_baseline_benchmark() to match the fusion definition

The implementation correctly doubles test coverage to benchmark both transpose scheduler patterns while maintaining backward compatibility and validation correctness.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-structured, maintain backward compatibility, and follow existing patterns in the codebase. The implementation correctly handles both 1-input and 2-input cases with proper conditional logic throughout. The validation and benchmarking logic has been updated consistently across both test functions. No logical errors, syntax issues, or breaking changes were found.
No files require special attention

Important Files Changed

Filename	Overview
benchmarks/python/test_transpose.py	Added `num_inputs` parameter (1 or 2) to test both smem→regs and regs→smem transpose patterns; changes are clean and consistent

Sequence Diagram

sequenceDiagram
    participant Test as Test Function
    participant Fusion as transpose_fusion()
    participant Execute as FusionDefinition.execute()
    participant Scheduler as Transpose Scheduler
    participant GPU as GPU Memory
    
    Test->>Test: Create input tensor(s)
    Test->>Fusion: Define fusion with num_inputs param
    
    alt num_inputs == 2
        Fusion->>Fusion: Define T0, T1 tensors
        Fusion->>Fusion: T4 = add(T0, T1)
        Note over Scheduler,GPU: Output transpose<br/>(regs → smem)
    else num_inputs == 1
        Fusion->>Fusion: Define T0 tensor only
        Fusion->>Fusion: T4 = T0
        Note over Scheduler,GPU: Input transpose<br/>(smem → regs)
    end
    
    Fusion->>Fusion: T5 = permute(T4)
    Fusion->>Fusion: T9 = relu(T5)
    
    alt is_copy_transpose
        Fusion->>Fusion: T10 = segment_set(T9)
        Fusion->>Fusion: add_output(T10)
    else view transpose
        Fusion->>Fusion: add_output(T9)
    end
    
    Test->>Execute: Run with input tensors
    Execute->>Scheduler: Select transpose scheduler
    Scheduler->>GPU: Execute optimized transpose
    GPU-->>Test: Return result
    Test->>Test: Benchmark performance

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

add transpose benchmark for smem->regs transpose

0a24294

liqiangxl marked this pull request as ready for review February 6, 2026 14:15

Merge branch 'main' into llu/transpose_bm

9478d11

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

liqiangxl requested review from Priya2698 and rdspring1 February 6, 2026 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add transpose benchmark for regs -> smem transpose#5927

add transpose benchmark for regs -> smem transpose#5927
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_bm

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Feb 6, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 6, 2026 •

edited

Loading