Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Description

  • Enhanced transpose scheduler to dynamically adjust tile size based on memory bandwidth requirements

  • Added bytes-in-flight calculation using Little's law to optimize memory utilization

  • Doubles tile_size2 when insufficient data is in flight to saturate memory bandwidth

  • Extended test infrastructure to validate transpose performance with 1 and 2 input tensors

Changes walkthrough

Relevant files
Enhancement
transpose.cpp
enhance transpose scheduler with dynamic tile sizing         

csrc/scheduler/transpose.cpp

  • Added logic to calculate bytes in flight per SM based on tensor sizes
    and tile dimensions
  • Implemented check to double tile_size2 when bits_in_flight_per_sm <
    required_bits_per_sm
  • Uses device properties and scheduler_utils::getRequiredBitsInFlight()
    for calculations
  • +23/-0   
    Tests
    test_transpose.py
    extend transpose tests for multiple input configurations 

    benchmarks/python/test_transpose.py

  • Added num_inputs parameter to support testing with 1 or 2 input
    tensors
  • Modified transpose_fusion and transpose_fwd_fn to handle variable
    input counts
  • Updated test cases to parameterize and validate both single and dual
    input scenarios
  • Enhanced benchmark infrastructure for comprehensive transpose
    performance testing
  • +65/-20 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Performance Optimization Logic

    The new logic doubles tile_size2 when bits_in_flight_per_sm < required_bits_per_sm. This is a reasonable approach, but consider whether a single doubling is sufficient or if iterative doubling might be needed for very large gaps. Also verify that the calculation of total_input_bits_per_elem correctly accounts for all input tensors in complex fusion scenarios.

    // Double tile_size2 if the default configuration doesn't provide enough
    // bytes in flight to saturate memory bandwidth. This is based on Little's
    // law: bytes_in_flight = bandwidth * latency. We estimate the bits in flight
    // per SM as: (sum of input tensor element sizes) * elements_per_tile *
    // blocks_per_sm. If this is less than the required bits in flight (derived
    // from hardware bandwidth and memory latency), we double tile_size2 to
    // increase the data in flight.
    const auto dev_prop = at::cuda::getCurrentDeviceProperties();
    const int64_t max_blocks_per_sm = dev_prop->maxBlocksPerMultiProcessor;
    const int64_t num_elems_per_tile = tparams->tile_size1 * tparams->tile_size2;
    const int64_t required_bits_per_sm =
        scheduler_utils::getRequiredBitsInFlight();
    int64_t total_input_bits_per_elem = 0;
    for (auto tv : ir_utils::filterByType<TensorView>(fusion->inputs())) {
      total_input_bits_per_elem +=
          dataTypeSizeBit(tv->getDataType().value(), index_type);
    }
    const int64_t bits_in_flight_per_sm =
        total_input_bits_per_elem * num_elems_per_tile * max_blocks_per_sm;
    if (bits_in_flight_per_sm < required_bits_per_sm) {
      tparams->tile_size2 *= 2;
    }
    Test Parameterization

    The test modifications introduce num_inputs parameter (1 or 2) to validate both single and multi-input transpose scenarios. Ensure that the test coverage adequately exercises the new tile size adjustment logic across different input configurations and that the baseline comparisons remain valid.

    @pytest.mark.parametrize(
        "num_inputs",
        [1, 2],
        ids=["1_input", "2_inputs"],
    )

    Test failures

    • (Medium, 3) Shape mismatch in thunder.test_update_aliases higher-order inplace alias update (nvFuser CUDA)

      Test Name A100 GB200 H100 Source
      thunder.tests.test_update_aliases.test_higher_order_inplace_alias_update_nvfuser_cuda_thunder.dtypes.float32

    @liqiangxl liqiangxl changed the base branch from main to llu/transpose_bm February 6, 2026 17:39
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant