iris.bench: add @bench.env() for environment variable sweeps

## Problem

RCCL (and other backends) configure key parameters via environment variables read at `init_process_group` time — e.g., `NCCL_MIN_CTAS`, `NCCL_MAX_CTAS`, `RCCL_WARP_SPEED_CU_COUNT`. These can't be changed mid-process, so they can't be swept as regular `@bench.axis()` values.

Currently the only workaround is a bash loop:
```bash
for ctas in 32 64 128 256; do
  NCCL_MIN_CTAS=$ctas NCCL_MAX_CTAS=$ctas python bench.py --format csv >> sweep.csv
done
```

This defeats the purpose of the declarative axis model.

## Proposed API

```python
@bench.register
@bench.axis("num_ranks", [8])
@bench.axis("M", [1, 32, 128, 512, 2048, 8192])
@bench.env("NCCL_MIN_CTAS", [32, 64, 128, 256])
@bench.env("NCCL_MAX_CTAS", [32, 64, 128, 256])
def rccl_cta_sweep(state, ctx):
    M = state["M"]
    tensor = torch.full((M, 2880), float(ctx.get_rank() + 1), dtype=torch.bfloat16,
                        device=f"cuda:{ctx.get_rank()}")
    state.set_bytes(M * 2880 * 2 * 2 * (ctx.get_num_ranks() - 1) / ctx.get_num_ranks())
    state.exec(lambda: dist.all_reduce(tensor, op=dist.ReduceOp.SUM))
```

## Behavior

- `@bench.env(name, values)` declares an environment variable sweep
- Like `num_ranks`, each unique env var combination triggers a **process respawn** (new `elastic_launch`)
- Env vars are set in the worker processes before `init_process_group`
- Multiple `@bench.env()` decorators produce a cartesian product of env combinations
- Env values appear as columns in the output table
- Env vars are restored/unset after the benchmark completes

## Implementation notes

- The runner already respawns processes for `num_ranks` changes. Env var changes would use the same mechanism — group axis combinations by `(num_ranks, env_vars)` and respawn when either changes.
- Env vars should be set via `os.environ` in the worker process before any torch/RCCL initialization.
- Consider coalescing: if two `@bench.env()` decorators have the same number of values and should be paired (not crossed), support a `zip` mode: `@bench.env({"NCCL_MIN_CTAS": 64, "NCCL_MAX_CTAS": 64})` or similar.

## Use cases

- Sweep RCCL CTA count alongside iris `comm_sms` for apples-to-apples CU comparison
- Enable/disable RCCL Warp Speed mode (`RCCL_WARP_SPEED_ENABLE`)
- Set `RCCL_WARP_SPEED_CU_COUNT` for direct CU control
- Any backend configured via environment (HSA, HIP, ROCm runtime flags)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris.bench: add @bench.env() for environment variable sweeps #488

Problem

Proposed API

Behavior

Implementation notes

Use cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

iris.bench: add @bench.env() for environment variable sweeps #488

Description

Problem

Proposed API

Behavior

Implementation notes

Use cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions