You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue
The transpose scheduler organizes input and output tensors into two groups, based on their vectorizable inner dimensions. Tensors in the second group are smaller in byte size, and their cached input/output versions are stored in shared memory to achieve transpose. This ensures the smallest shared memory size is used.
For input tensors cached in shared memory, the data transpose occurs during the load from shared memory to registers:
Add new test parameter num_inputs with values [1, 2] to support both single and dual input transpose scenarios
Modify transpose_fusion function to conditionally create one or two input tensors based on num_inputs parameter
Update transpose_fwd_fn to handle both 1-input and 2-input cases in the baseline benchmark function
Extend both test_transpose_nvf_benchmark and test_transpose_baseline_benchmark to test input cache transpose (smem→regs) and output cache transpose (regs→smem) scenarios
Changes walkthrough
Relevant files
Tests
test_transpose.py
Extend transpose benchmark to cover input cache transpose
benchmarks/python/test_transpose.py
Added num_inputs parameter to transpose_fusion function with default value 2
Modified tensor creation logic to conditionally create one or two input tensors
Updated transpose_fwd_fn to handle both single and dual input scenarios
Added new test parameters for 1-input and 2-input test cases
Updated both nvFuser and baseline benchmark functions to support new test scenarios
Added num_inputs parameter to transpose benchmark tests to enable testing both transpose patterns: (1) input cached in smem with transpose during smem→regs load, and (2) output cached in smem with transpose during regs→smem write.
Key changes:
Modified transpose_fusion() to accept num_inputs parameter (1 or 2), conditionally creating second input tensor and add operation only when num_inputs == 2
Updated transpose_fwd_fn() to handle variable input lists, parsing num_inputs from last element and adjusting dim indices accordingly
Added @pytest.mark.parametrize decorator for num_inputs parameter with values [1, 2] to both test functions
Adjusted input tensor creation and validation logic in both test_transpose_nvf_benchmark() and test_transpose_baseline_benchmark() to match the fusion definition
The implementation correctly doubles test coverage to benchmark both transpose scheduler patterns while maintaining backward compatibility and validation correctness.
Confidence Score: 5/5
This PR is safe to merge with minimal risk
The changes are well-structured, maintain backward compatibility, and follow existing patterns in the codebase. The implementation correctly handles both 1-input and 2-input cases with proper conditional logic throughout. The validation and benchmarking logic has been updated consistently across both test functions. No logical errors, syntax issues, or breaking changes were found.
No files require special attention
Important Files Changed
Filename
Overview
benchmarks/python/test_transpose.py
Added num_inputs parameter (1 or 2) to test both smem→regs and regs→smem transpose patterns; changes are clean and consistent
Sequence Diagram
sequenceDiagram
participant Test as Test Function
participant Fusion as transpose_fusion()
participant Execute as FusionDefinition.execute()
participant Scheduler as Transpose Scheduler
participant GPU as GPU Memory
Test->>Test: Create input tensor(s)
Test->>Fusion: Define fusion with num_inputs param
alt num_inputs == 2
Fusion->>Fusion: Define T0, T1 tensors
Fusion->>Fusion: T4 = add(T0, T1)
Note over Scheduler,GPU: Output transpose<br/>(regs → smem)
else num_inputs == 1
Fusion->>Fusion: Define T0 tensor only
Fusion->>Fusion: T4 = T0
Note over Scheduler,GPU: Input transpose<br/>(smem → regs)
end
Fusion->>Fusion: T5 = permute(T4)
Fusion->>Fusion: T9 = relu(T5)
alt is_copy_transpose
Fusion->>Fusion: T10 = segment_set(T9)
Fusion->>Fusion: add_output(T10)
else view transpose
Fusion->>Fusion: add_output(T9)
end
Test->>Execute: Run with input tensors
Execute->>Scheduler: Select transpose scheduler
Scheduler->>GPU: Execute optimized transpose
GPU-->>Test: Return result
Test->>Test: Benchmark performance
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
The transpose scheduler organizes input and output tensors into two groups, based on their vectorizable inner dimensions. Tensors in the second group are smaller in byte size, and their cached input/output versions are stored in shared memory to achieve transpose. This ensures the smallest shared memory size is used.
For input tensors cached in shared memory, the data transpose occurs during the load from shared memory to registers:
Input →
smem → (transpose) registers→ computations → outputFor output tensors cached in shared memory, the data transpose occurs when registers write the results to shared memory:
Input → registers → computations ->
regs→ (transpose) smem→ outputCurrent benchmark only covers the 2nd case, it has two inputs and one ouput, the output cache is stored in smem and transposed during smem write.
Fix:
This PR adds an additional test with 1 input and 1 output, in this case, the input cache is stored in smem and transposed during smem read.
Current performance of these two types of transpose is in this doc.