Conversation
Greptile OverviewGreptile SummaryThis PR adds a comprehensive manually-scheduled test for outer reduction operations using 2D TMA (Tensor Memory Accelerator), complementing the existing inner reduction TMA tests. Key changes:
The test follows established patterns from Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Test Framework
participant Fusion as Fusion Builder
participant Schedule as Scheduler
participant TMA as TMA Cache
participant Executor as KernelExecutor
participant Validation as Validator
Test->>Fusion: Create fusion with 2D tensor input [R, I]
Fusion->>Fusion: Add sum reduction on axis 0 (outer reduction)
Fusion->>Fusion: Save fusion_copy for validation
Test->>Schedule: Create TMA cache in shared memory
Schedule->>TMA: cacheAfter with CpAsyncBulkTensorTile
TMA->>TMA: Set memory type to Shared
Test->>Schedule: Apply TMA-level tiling
Schedule->>Schedule: Split reduction dimension by tma_tile_r
Schedule->>Schedule: Split iteration dimension by tma_tile_i
Schedule->>Schedule: Split for grid parallelization (grdim)
Test->>Schedule: Propagate TMA tiling to all tensors
Schedule->>Schedule: MaxLogicalDomainInfoSpanningTree traverse
Test->>Schedule: Parallelize TMA tensor
Schedule->>Schedule: Set BIDy, Serial, Bulk, BIDx parallelization
Schedule->>Schedule: Set allocation domain for shared memory
Test->>Schedule: Sub-split TMA tiles into thread dimensions
Schedule->>Schedule: Split tma_tile_i by bdimx (32)
Schedule->>Schedule: Split tma_tile_r by bdimy (16)
Test->>Schedule: Parallelize reduction tensor
Schedule->>Schedule: Apply rFactor for grid reduction
Schedule->>Schedule: Propagate thread-level splits to non-TMA TVs
Test->>Schedule: Set up iter-grouped reduction
Schedule->>Schedule: propagateParallelization with use_iter_grouped_reduction=true
Schedule->>Schedule: Apply inlineMost optimization
Test->>Executor: Compile fusion with input tensor
Executor->>Executor: Generate CUDA kernel code
Test->>Executor: Run kernel with test input
Executor-->>Test: Return computed outputs
Test->>Validation: testValidate with fusion_copy
Validation->>Validation: Compare scheduled vs unscheduled results
Validation-->>Test: Validation success
|
Description
|
| Relevant files | |||
|---|---|---|---|
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Complex Manual Scheduling
|
|
Is the generated kernel same to what we discussed offline? You can attached the generated code & fusion_ir to this PR. |
|
@liqiangxl I added The schedule is very similar. For 16384x16384, both use grid reduction and use To keep things simple, I don't do block-reduce vs grid-reduce comparison. Just have a few options for BIDy count and that's all. |
|
|
||
| // ========== Phase 5: Sub-split TMA tiles into thread dims ========== | ||
| // Split tma_tile_i into [iter_unroll, bdimx] | ||
| redu_tv->split(4, bdimx); |
There was a problem hiding this comment.
Iter domain is the inner most domain, we should spilt it by vectorization factor first, then bdimx, this allows vectorized write of reduction result to gmem.
| tv0smem->axis(4)->parallelize(ParallelType::Bulk); // iteration tile | ||
|
|
||
| // Set allocation domain for proper shared memory layout | ||
| tv0smem->setAllocationDomain(tv0smem->getLoopDomain(), true); |
There was a problem hiding this comment.
Do we need this allocation domain set?
| reduction_tvs, | ||
| /*unroll_vectorizable_cached_tvs=*/{}, | ||
| /*selected_tvs=*/non_tma_tvs); | ||
|
|
There was a problem hiding this comment.
Need vectorize output tensor
Thanks. Can you add block reduction kernel and ir? unswitch is a performance optimization, we can check that later. |
Add manually scheduled test for outer-reduction with 2D TMA.
Dumps for
outer_16384_iter_16384:TMA cuda_kernel
non-TMA cuda_kernel
TMA fusion_ir
Inputs: T0_g_float[iS15{8}, iS16{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iS14{128}, iS17{( ceilDiv(i2, 128) )}, iS18{128}] Outputs: T1_g_float[iblockIdx.x25{( ceilDiv(i2, 128) )}, iS50{4}, ithreadIdx.x51{32}] ca_pos( 3 ) %kernel { T2_s_float[iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iB8{128}, iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128}] ca_pos( 2 ) = CpAsyncBulkTensorTile( T0_g_float[iS15{8}, iS16{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iS14{128}, iS17{( ceilDiv(i2, 128) )}, iS18{128}] ) T4_l_float[iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf, rS37{8}rf, ithreadIdx.y38{16}rf, iblockIdx.x39{( ceilDiv(i2, 128) )}, iS41{4}, ithreadIdx.x42{32}] ca_pos( 1 ) produce_pos( 2 ) = reduction( T2_s_float[iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iB8{128}, iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128}] ca_pos( 2 ), op = add, initial value = float(0), allreduce = false ) T3_l_float[rblockIdx.y43{8}, rthreadIdx.y44{16}, iblockIdx.x46{( ceilDiv(i2, 128) )}, iG48{4}, ithreadIdx.x49{32}] produce_pos( 1 ) = reduction( T4_l_float[iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf, rS37{8}rf, ithreadIdx.y38{16}rf, iblockIdx.x39{( ceilDiv(i2, 128) )}, iS41{4}, ithreadIdx.x42{32}] ca_pos( 1 ) produce_pos( 2 ), op = add, initial value = float(0), allreduce = false ) T1_g_float[iblockIdx.x25{( ceilDiv(i2, 128) )}, iS50{4}, ithreadIdx.x51{32}] ca_pos( 3 ) = Set( T3_l_float[rblockIdx.y43{8}, rthreadIdx.y44{16}, iblockIdx.x46{( ceilDiv(i2, 128) )}, iG48{4}, ithreadIdx.x49{32}] produce_pos( 1 ), cache_op=Streaming ) TransformPrinter : T0_g_float[iS15{8}, iS16{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iS14{128}, iS17{( ceilDiv(i2, 128) )}, iS18{128}] logical domain: (iS0{i0}, iS1{i2}) contiguity: t t Split: iS0{i0} by factor 128 -> iS13{( ceilDiv(i0, 128) )}, iS14{128} Outer split: iS13{( ceilDiv(i0, 128) )} by factor 8 -> iS15{8}, iS16{( ceilDiv(( ceilDiv(i0, 128) ), 8) )} Split: iS1{i2} by factor 128 -> iS17{( ceilDiv(i2, 128) )}, iS18{128} loop domain: (iS15{8}, iS16{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iS14{128}, iS17{( ceilDiv(i2, 128) )}, iS18{128}) T2_s_float[iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iB8{128}, iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128}] ca_pos( 2 ) logical domain: (iS4{i0}, iS5{i2}) allocation domain: (iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iB8{128}, iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128}) contiguity: t t t t t Split: iS4{i0} by factor 128 -> iS7{( ceilDiv(i0, 128) )}, iB8{128} Outer split: iS7{( ceilDiv(i0, 128) )} by factor 8 -> iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )} Split: iS5{i2} by factor 128 -> iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128} loop domain: (iblockIdx.y11{8}, iS12{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, iB8{128}, iblockIdx.x9{( ceilDiv(i2, 128) )}, iB10{128}) T4_l_float[iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf, rS37{8}rf, ithreadIdx.y38{16}rf, iblockIdx.x39{( ceilDiv(i2, 128) )}, iS41{4}, ithreadIdx.x42{32}] ca_pos( 1 ) produce_pos( 2 ) root domain: (rS31{i0}rf, iS32{i2}) Split: rS31{i0}rf by factor 128 -> rS33{( ceilDiv(i0, 128) )}rf, rS34{128}rf Outer split: rS33{( ceilDiv(i0, 128) )}rf by factor 8 -> iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf Split: rS34{128}rf by factor 16 -> rS37{8}rf, ithreadIdx.y38{16}rf logical domain: (iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf, rS37{8}rf, ithreadIdx.y38{16}rf, iS32{i2}) contiguity: t n n t t Split: iS32{i2} by factor 128 -> iblockIdx.x39{( ceilDiv(i2, 128) )}, iS40{128} Split: rS31{i0}rf by factor 128 -> rS33{( ceilDiv(i0, 128) )}rf, rS34{128}rf Outer split: rS33{( ceilDiv(i0, 128) )}rf by factor 8 -> iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf Split: rS34{128}rf by factor 16 -> rS37{8}rf, ithreadIdx.y38{16}rf Split: iS40{128} by factor 32 -> iS41{4}, ithreadIdx.x42{32} loop domain: (iblockIdx.y35{8}rf, rS36{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}rf, rS37{8}rf, ithreadIdx.y38{16}rf, iblockIdx.x39{( ceilDiv(i2, 128) )}, iS41{4}, ithreadIdx.x42{32}) T3_l_float[rblockIdx.y43{8}, rthreadIdx.y44{16}, iblockIdx.x46{( ceilDiv(i2, 128) )}, iG48{4}, ithreadIdx.x49{32}] produce_pos( 1 ) logical domain: (rblockIdx.y43{8}, rthreadIdx.y44{16}, iS45{i2}) contiguity: n n t Split: iS45{i2} by factor 128 -> iblockIdx.x46{( ceilDiv(i2, 128) )}, iS47{128} Split: iS47{128} by factor 32 -> iG48{4}, ithreadIdx.x49{32} loop domain: (rblockIdx.y43{8}, rthreadIdx.y44{16}, iblockIdx.x46{( ceilDiv(i2, 128) )}, iG48{4}, ithreadIdx.x49{32}) T1_g_float[iblockIdx.x25{( ceilDiv(i2, 128) )}, iS50{4}, ithreadIdx.x51{32}] ca_pos( 3 ) logical domain: (iS6{i2}) contiguity: t Split: iS6{i2} by factor 128 -> iblockIdx.x25{( ceilDiv(i2, 128) )}, iS26{128} Split: iS26{128} by factor 32 -> iS50{4}, ithreadIdx.x51{32} loop domain: (iblockIdx.x25{( ceilDiv(i2, 128) )}, iS50{4}, ithreadIdx.x51{32}) } // %kernelnon-TMA fusion_ir
Inputs: T0_g_float[iS66{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS65{blockDim.x}, iS75{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iS67{1}, iS63{4}, iS74{gridDim.y}, iS69{blockDim.y}, iS73{1}, iS71{8}] Outputs: T1_g_float[iblockIdx.x80{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x79{blockDim.x}, iUS81{1}, iV77{4}] ca_pos( 3 ) produce_pos( 3 ) %kernel { T2_l_float[iblockIdx.x52{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x51{blockDim.x}, iS61{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iUS53{1}, iV49{4}, iblockIdx.y60{gridDim.y}, ithreadIdx.y55{blockDim.y}, iUS59{1}, iUR57{8}] ca_pos( 4 ) = Set( T0_g_float[iS66{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS65{blockDim.x}, iS75{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iS67{1}, iS63{4}, iS74{gridDim.y}, iS69{blockDim.y}, iS73{1}, iS71{8}], cache_op=Streaming ) T4_l_float[iblockIdx.x29{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x28{blockDim.x}, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf, iUS30{1}, iS26{4}, iblockIdx.y37{gridDim.y}rf, ithreadIdx.y32{blockDim.y}rf, rUS36{1}rf, rS34{8}rf] ca_pos( 2 ) produce_pos( 4 ) = reduction( T2_l_float[iblockIdx.x52{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x51{blockDim.x}, iS61{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iUS53{1}, iV49{4}, iblockIdx.y60{gridDim.y}, ithreadIdx.y55{blockDim.y}, iUS59{1}, iUR57{8}] ca_pos( 4 ), op = add, initial value = float(0), allreduce = false ) T3_l_float[iblockIdx.x46{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x45{blockDim.x}, iUS47{1}, iG43{4}, rblockIdx.y39{gridDim.y}, rthreadIdx.y40{blockDim.y}] ca_pos( 3 ) produce_pos( 2 ) = reduction( T4_l_float[iblockIdx.x29{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x28{blockDim.x}, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf, iUS30{1}, iS26{4}, iblockIdx.y37{gridDim.y}rf, ithreadIdx.y32{blockDim.y}rf, rUS36{1}rf, rS34{8}rf] ca_pos( 2 ) produce_pos( 4 ), op = add, initial value = float(0), allreduce = false ) T1_g_float[iblockIdx.x80{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x79{blockDim.x}, iUS81{1}, iV77{4}] ca_pos( 3 ) produce_pos( 3 ) = Set( T3_l_float[iblockIdx.x46{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x45{blockDim.x}, iUS47{1}, iG43{4}, rblockIdx.y39{gridDim.y}, rthreadIdx.y40{blockDim.y}] ca_pos( 3 ) produce_pos( 2 ), cache_op=Streaming ) TransformPrinter : T0_g_float[iS66{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS65{blockDim.x}, iS75{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iS67{1}, iS63{4}, iS74{gridDim.y}, iS69{blockDim.y}, iS73{1}, iS71{8}] logical domain: (iS0{i0}, iS1{i2}) contiguity: t t Split: iS1{i2} by factor 4 -> iS62{( ceilDiv(i2, 4) )}, iS63{4} Split: iS0{i0} by factor blockDim.y -> iS68{( ceilDiv(i0, blockDim.y) )}, iS69{blockDim.y} Split: iS62{( ceilDiv(i2, 4) )} by factor blockDim.x -> iS64{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS65{blockDim.x} Split: iS68{( ceilDiv(i0, blockDim.y) )} by factor 8 -> iS70{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}, iS71{8} Split: iS64{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )} by factor 1 -> iS66{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS67{1} Split: iS70{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )} by factor 1 -> iS72{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}, iS73{1} Outer split: iS72{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )} by factor gridDim.y -> iS74{gridDim.y}, iS75{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )} loop domain: (iS66{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iS65{blockDim.x}, iS75{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iS67{1}, iS63{4}, iS74{gridDim.y}, iS69{blockDim.y}, iS73{1}, iS71{8}) T2_l_float[iblockIdx.x52{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x51{blockDim.x}, iS61{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iUS53{1}, iV49{4}, iblockIdx.y60{gridDim.y}, ithreadIdx.y55{blockDim.y}, iUS59{1}, iUR57{8}] ca_pos( 4 ) logical domain: (iS6{i0}, iS7{i2}) contiguity: t t Split: iS7{i2} by factor 4 -> iS48{( ceilDiv(i2, 4) )}, iV49{4} Split: iS6{i0} by factor blockDim.y -> iS54{( ceilDiv(i0, blockDim.y) )}, ithreadIdx.y55{blockDim.y} Split: iS48{( ceilDiv(i2, 4) )} by factor blockDim.x -> iS50{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x51{blockDim.x} Split: iS54{( ceilDiv(i0, blockDim.y) )} by factor 8 -> iS56{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}, iUR57{8} Split: iS50{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )} by factor 1 -> iblockIdx.x52{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iUS53{1} Split: iS56{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )} by factor 1 -> iS58{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}, iUS59{1} Outer split: iS58{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )} by factor gridDim.y -> iblockIdx.y60{gridDim.y}, iS61{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )} loop domain: (iblockIdx.x52{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x51{blockDim.x}, iS61{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}, iUS53{1}, iV49{4}, iblockIdx.y60{gridDim.y}, ithreadIdx.y55{blockDim.y}, iUS59{1}, iUR57{8}) T4_l_float[iblockIdx.x29{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x28{blockDim.x}, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf, iUS30{1}, iS26{4}, iblockIdx.y37{gridDim.y}rf, ithreadIdx.y32{blockDim.y}rf, rUS36{1}rf, rS34{8}rf] ca_pos( 2 ) produce_pos( 4 ) root domain: (rS23{i0}rf, iS24{i2}) Split: rS23{i0}rf by factor blockDim.y -> rS31{( ceilDiv(i0, blockDim.y) )}rf, ithreadIdx.y32{blockDim.y}rf Split: rS31{( ceilDiv(i0, blockDim.y) )}rf by factor 8 -> rS33{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf, rS34{8}rf Split: rS33{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf by factor 1 -> rS35{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf, rUS36{1}rf Outer split: rS35{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf by factor gridDim.y -> iblockIdx.y37{gridDim.y}rf, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf logical domain: (iblockIdx.y37{gridDim.y}rf, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf, rUS36{1}rf, rS34{8}rf, ithreadIdx.y32{blockDim.y}rf, iS24{i2}) contiguity: t n n n t t Split: iS24{i2} by factor 4 -> iS25{( ceilDiv(i2, 4) )}, iS26{4} Split: rS23{i0}rf by factor blockDim.y -> rS31{( ceilDiv(i0, blockDim.y) )}rf, ithreadIdx.y32{blockDim.y}rf Split: iS25{( ceilDiv(i2, 4) )} by factor blockDim.x -> iS27{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x28{blockDim.x} Split: rS31{( ceilDiv(i0, blockDim.y) )}rf by factor 8 -> rS33{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf, rS34{8}rf Split: iS27{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )} by factor 1 -> iblockIdx.x29{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iUS30{1} Split: rS33{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf by factor 1 -> rS35{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf, rUS36{1}rf Outer split: rS35{( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) )}rf by factor gridDim.y -> iblockIdx.y37{gridDim.y}rf, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf loop domain: (iblockIdx.x29{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x28{blockDim.x}, rS38{( ceilDiv(( ceilDiv(( ceilDiv(i0, blockDim.y) ), 8) ), gridDim.y) )}rf, iUS30{1}, iS26{4}, iblockIdx.y37{gridDim.y}rf, ithreadIdx.y32{blockDim.y}rf, rUS36{1}rf, rS34{8}rf) T3_l_float[iblockIdx.x46{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x45{blockDim.x}, iUS47{1}, iG43{4}, rblockIdx.y39{gridDim.y}, rthreadIdx.y40{blockDim.y}] ca_pos( 3 ) produce_pos( 2 ) logical domain: (rblockIdx.y39{gridDim.y}, rthreadIdx.y40{blockDim.y}, iS41{i2}) contiguity: n n t Split: iS41{i2} by factor 4 -> iS42{( ceilDiv(i2, 4) )}, iG43{4} Split: iS42{( ceilDiv(i2, 4) )} by factor blockDim.x -> iS44{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x45{blockDim.x} Split: iS44{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )} by factor 1 -> iblockIdx.x46{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iUS47{1} loop domain: (iblockIdx.x46{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x45{blockDim.x}, iUS47{1}, iG43{4}, rblockIdx.y39{gridDim.y}, rthreadIdx.y40{blockDim.y}) T1_g_float[iblockIdx.x80{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x79{blockDim.x}, iUS81{1}, iV77{4}] ca_pos( 3 ) produce_pos( 3 ) logical domain: (iS8{i2}) contiguity: t Split: iS8{i2} by factor 4 -> iS76{( ceilDiv(i2, 4) )}, iV77{4} Split: iS76{( ceilDiv(i2, 4) )} by factor blockDim.x -> iS78{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x79{blockDim.x} Split: iS78{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )} by factor 1 -> iblockIdx.x80{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, iUS81{1} loop domain: (iblockIdx.x80{( ceilDiv(( ceilDiv(i2, 4) ), blockDim.x) )}, ithreadIdx.x79{blockDim.x}, iUS81{1}, iV77{4}) } // %kernel