Skip to content

add transpose benchmark for regs -> smem transpose#5927

Open
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_bm
Open

add transpose benchmark for regs -> smem transpose#5927
liqiangxl wants to merge 2 commits intomainfrom
llu/transpose_bm

Conversation

@liqiangxl
Copy link
Collaborator

Issue
The transpose scheduler organizes input and output tensors into two groups, based on their vectorizable inner dimensions. Tensors in the second group are smaller in byte size, and their cached input/output versions are stored in shared memory to achieve transpose. This ensures the smallest shared memory size is used.

For input tensors cached in shared memory, the data transpose occurs during the load from shared memory to registers:

Input → smem → (transpose) registers → computations → output

For output tensors cached in shared memory, the data transpose occurs when registers write the results to shared memory:

Input → registers → computations -> regs→ (transpose) smem → output

Current benchmark only covers the 2nd case, it has two inputs and one ouput, the output cache is stored in smem and transposed during smem write.

Fix:
This PR adds an additional test with 1 input and 1 output, in this case, the input cache is stored in smem and transposed during smem read.

Current performance of these two types of transpose is in this doc.

@liqiangxl liqiangxl marked this pull request as ready for review February 6, 2026 14:15
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Review updated until commit 9478d11

Description

  • Add new test parameter num_inputs with values [1, 2] to support both single and dual input transpose scenarios

  • Modify transpose_fusion function to conditionally create one or two input tensors based on num_inputs parameter

  • Update transpose_fwd_fn to handle both 1-input and 2-input cases in the baseline benchmark function

  • Extend both test_transpose_nvf_benchmark and test_transpose_baseline_benchmark to test input cache transpose (smem→regs) and output cache transpose (regs→smem) scenarios

Changes walkthrough

Relevant files
Tests
test_transpose.py
Extend transpose benchmark to cover input cache transpose

benchmarks/python/test_transpose.py

  • Added num_inputs parameter to transpose_fusion function with default
    value 2
  • Modified tensor creation logic to conditionally create one or two
    input tensors
  • Updated transpose_fwd_fn to handle both single and dual input
    scenarios
  • Added new test parameters for 1-input and 2-input test cases
  • Updated both nvFuser and baseline benchmark functions to support new
    test scenarios
  • +65/-20 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ No major issues detected

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 6, 2026

    Greptile Overview

    Greptile Summary

    Added num_inputs parameter to transpose benchmark tests to enable testing both transpose patterns: (1) input cached in smem with transpose during smem→regs load, and (2) output cached in smem with transpose during regs→smem write.

    Key changes:

    • Modified transpose_fusion() to accept num_inputs parameter (1 or 2), conditionally creating second input tensor and add operation only when num_inputs == 2
    • Updated transpose_fwd_fn() to handle variable input lists, parsing num_inputs from last element and adjusting dim indices accordingly
    • Added @pytest.mark.parametrize decorator for num_inputs parameter with values [1, 2] to both test functions
    • Adjusted input tensor creation and validation logic in both test_transpose_nvf_benchmark() and test_transpose_baseline_benchmark() to match the fusion definition

    The implementation correctly doubles test coverage to benchmark both transpose scheduler patterns while maintaining backward compatibility and validation correctness.

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The changes are well-structured, maintain backward compatibility, and follow existing patterns in the codebase. The implementation correctly handles both 1-input and 2-input cases with proper conditional logic throughout. The validation and benchmarking logic has been updated consistently across both test functions. No logical errors, syntax issues, or breaking changes were found.
    • No files require special attention

    Important Files Changed

    Filename Overview
    benchmarks/python/test_transpose.py Added num_inputs parameter (1 or 2) to test both smem→regs and regs→smem transpose patterns; changes are clean and consistent

    Sequence Diagram

    sequenceDiagram
        participant Test as Test Function
        participant Fusion as transpose_fusion()
        participant Execute as FusionDefinition.execute()
        participant Scheduler as Transpose Scheduler
        participant GPU as GPU Memory
        
        Test->>Test: Create input tensor(s)
        Test->>Fusion: Define fusion with num_inputs param
        
        alt num_inputs == 2
            Fusion->>Fusion: Define T0, T1 tensors
            Fusion->>Fusion: T4 = add(T0, T1)
            Note over Scheduler,GPU: Output transpose<br/>(regs → smem)
        else num_inputs == 1
            Fusion->>Fusion: Define T0 tensor only
            Fusion->>Fusion: T4 = T0
            Note over Scheduler,GPU: Input transpose<br/>(smem → regs)
        end
        
        Fusion->>Fusion: T5 = permute(T4)
        Fusion->>Fusion: T9 = relu(T5)
        
        alt is_copy_transpose
            Fusion->>Fusion: T10 = segment_set(T9)
            Fusion->>Fusion: add_output(T10)
        else view transpose
            Fusion->>Fusion: add_output(T9)
        end
        
        Test->>Execute: Run with input tensors
        Execute->>Scheduler: Select transpose scheduler
        Scheduler->>GPU: Execute optimized transpose
        GPU-->>Test: Return result
        Test->>Test: Benchmark performance
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant