increase transpose tile size to meet required bytes in flight #5928

liqiangxl · 2026-02-06T15:49:17Z

No description provided.

github-actions · 2026-02-06T15:52:13Z

Description

Enhanced transpose scheduler to dynamically adjust tile size based on memory bandwidth requirements
Added bytes-in-flight calculation using Little's law to optimize memory utilization
Doubles tile_size2 when insufficient data is in flight to saturate memory bandwidth
Extended test infrastructure to validate transpose performance with 1 and 2 input tensors

Changes walkthrough

Relevant files

Enhancement

transpose.cpp `enhance transpose scheduler with dynamic tile sizing` csrc/scheduler/transpose.cpp Added logic to calculate bytes in flight per SM based on tensor sizes and tile dimensions Implemented check to double tile_size2 when bits_in_flight_per_sm < required_bits_per_sm Uses device properties and scheduler_utils::getRequiredBitsInFlight() for calculations	+23/-0

Tests

test_transpose.py `extend transpose tests for multiple input configurations` benchmarks/python/test_transpose.py Added num_inputs parameter to support testing with 1 or 2 input tensors Modified transpose_fusion and transpose_fwd_fn to handle variable input counts Updated test cases to parameterize and validate both single and dual input scenarios Enhanced benchmark infrastructure for comprehensive transpose performance testing	+65/-20

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Performance Optimization Logic

The new logic doubles tile_size2 when bits_in_flight_per_sm < required_bits_per_sm. This is a reasonable approach, but consider whether a single doubling is sufficient or if iterative doubling might be needed for very large gaps. Also verify that the calculation of total_input_bits_per_elem correctly accounts for all input tensors in complex fusion scenarios.

// Double tile_size2 if the default configuration doesn't provide enough
// bytes in flight to saturate memory bandwidth. This is based on Little's
// law: bytes_in_flight = bandwidth * latency. We estimate the bits in flight
// per SM as: (sum of input tensor element sizes) * elements_per_tile *
// blocks_per_sm. If this is less than the required bits in flight (derived
// from hardware bandwidth and memory latency), we double tile_size2 to
// increase the data in flight.
const auto dev_prop = at::cuda::getCurrentDeviceProperties();
const int64_t max_blocks_per_sm = dev_prop->maxBlocksPerMultiProcessor;
const int64_t num_elems_per_tile = tparams->tile_size1 * tparams->tile_size2;
const int64_t required_bits_per_sm =
    scheduler_utils::getRequiredBitsInFlight();
int64_t total_input_bits_per_elem = 0;
for (auto tv : ir_utils::filterByType<TensorView>(fusion->inputs())) {
  total_input_bits_per_elem +=
      dataTypeSizeBit(tv->getDataType().value(), index_type);
}
const int64_t bits_in_flight_per_sm =
    total_input_bits_per_elem * num_elems_per_tile * max_blocks_per_sm;
if (bits_in_flight_per_sm < required_bits_per_sm) {
  tparams->tile_size2 *= 2;
}

Test Parameterization

The test modifications introduce num_inputs parameter (1 or 2) to validate both single and multi-input transpose scenarios. Ensure that the test coverage adequately exercises the new tile size adjustment logic across different input configurations and that the baseline comparisons remain valid.

@pytest.mark.parametrize(
    "num_inputs",
    [1, 2],
    ids=["1_input", "2_inputs"],
)

Test failures

(Medium, 3) Shape mismatch in thunder.test_update_aliases higher-order inplace alias update (nvFuser CUDA)

Test Name A100 GB200 H100 Source

thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32 ❌ ❌ ❌

liqiangxl · 2026-02-06T17:40:11Z

!test

increase transpose tile size to meet required bytes in flight

127dc46

liqiangxl changed the base branch from main to llu/transpose_bm February 6, 2026 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase transpose tile size to meet required bytes in flight #5928

increase transpose tile size to meet required bytes in flight #5928

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 •

edited by xwang233

Loading

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

increase transpose tile size to meet required bytes in flight #5928

Are you sure you want to change the base?

increase transpose tile size to meet required bytes in flight #5928

Conversation

liqiangxl commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

liqiangxl commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 6, 2026 •

edited by xwang233

Loading