Unified benchmarking harness

Benchmarking code in Iris is currently scattered across `benchmark/` and `examples/`, with each script re-implementing the same logic (warmup loops, synchronization, timing, averaging, printing). Over time this has led to copy-pasted code, inconsistent measurement patterns, and benchmarks that are hard to reuse or automate.

It would be useful to introduce a small, shared benchmarking harness (e.g. `iris.bench`) that standardizes:
- warmup and iteration handling
- timing and synchronization
- basic statistics (mean / p50 / p99)
- parameter sweeps
- structured result output (e.g. JSON or dict)

This would allow both `examples/` and `benchmark/` to share the same timing infrastructure, while keeping example code focused on semantics rather than measurement boilerplate.

Example (sketch):

from iris.bench import benchmark

@benchmark(name="gemm_all_scatter", warmup=5, iters=50)
def run(size, world_size):
    # setup tensors
    # launch Iris kernel
    kernel(...)

Internally you can use iris do bench and any code we have. Such a harness would significantly reduce duplicated code, improve maintainability, and make it easier to add consistent benchmarks and eventually integrate CI performance tracking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified benchmarking harness #367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unified benchmarking harness #367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions