Problem
RCCL (and other backends) configure key parameters via environment variables read at init_process_group time — e.g., NCCL_MIN_CTAS, NCCL_MAX_CTAS, RCCL_WARP_SPEED_CU_COUNT. These can't be changed mid-process, so they can't be swept as regular @bench.axis() values.
Currently the only workaround is a bash loop:
for ctas in 32 64 128 256; do
NCCL_MIN_CTAS=$ctas NCCL_MAX_CTAS=$ctas python bench.py --format csv >> sweep.csv
done
This defeats the purpose of the declarative axis model.
Proposed API
@bench.register
@bench.axis("num_ranks", [8])
@bench.axis("M", [1, 32, 128, 512, 2048, 8192])
@bench.env("NCCL_MIN_CTAS", [32, 64, 128, 256])
@bench.env("NCCL_MAX_CTAS", [32, 64, 128, 256])
def rccl_cta_sweep(state, ctx):
M = state["M"]
tensor = torch.full((M, 2880), float(ctx.get_rank() + 1), dtype=torch.bfloat16,
device=f"cuda:{ctx.get_rank()}")
state.set_bytes(M * 2880 * 2 * 2 * (ctx.get_num_ranks() - 1) / ctx.get_num_ranks())
state.exec(lambda: dist.all_reduce(tensor, op=dist.ReduceOp.SUM))
Behavior
@bench.env(name, values) declares an environment variable sweep
- Like
num_ranks, each unique env var combination triggers a process respawn (new elastic_launch)
- Env vars are set in the worker processes before
init_process_group
- Multiple
@bench.env() decorators produce a cartesian product of env combinations
- Env values appear as columns in the output table
- Env vars are restored/unset after the benchmark completes
Implementation notes
- The runner already respawns processes for
num_ranks changes. Env var changes would use the same mechanism — group axis combinations by (num_ranks, env_vars) and respawn when either changes.
- Env vars should be set via
os.environ in the worker process before any torch/RCCL initialization.
- Consider coalescing: if two
@bench.env() decorators have the same number of values and should be paired (not crossed), support a zip mode: @bench.env({"NCCL_MIN_CTAS": 64, "NCCL_MAX_CTAS": 64}) or similar.
Use cases
- Sweep RCCL CTA count alongside iris
comm_sms for apples-to-apples CU comparison
- Enable/disable RCCL Warp Speed mode (
RCCL_WARP_SPEED_ENABLE)
- Set
RCCL_WARP_SPEED_CU_COUNT for direct CU control
- Any backend configured via environment (HSA, HIP, ROCm runtime flags)
Problem
RCCL (and other backends) configure key parameters via environment variables read at
init_process_grouptime — e.g.,NCCL_MIN_CTAS,NCCL_MAX_CTAS,RCCL_WARP_SPEED_CU_COUNT. These can't be changed mid-process, so they can't be swept as regular@bench.axis()values.Currently the only workaround is a bash loop:
This defeats the purpose of the declarative axis model.
Proposed API
Behavior
@bench.env(name, values)declares an environment variable sweepnum_ranks, each unique env var combination triggers a process respawn (newelastic_launch)init_process_group@bench.env()decorators produce a cartesian product of env combinationsImplementation notes
num_rankschanges. Env var changes would use the same mechanism — group axis combinations by(num_ranks, env_vars)and respawn when either changes.os.environin the worker process before any torch/RCCL initialization.@bench.env()decorators have the same number of values and should be paired (not crossed), support azipmode:@bench.env({"NCCL_MIN_CTAS": 64, "NCCL_MAX_CTAS": 64})or similar.Use cases
comm_smsfor apples-to-apples CU comparisonRCCL_WARP_SPEED_ENABLE)RCCL_WARP_SPEED_CU_COUNTfor direct CU control