Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
c41742c
add geak benchmark and repo tasks
Mar 9, 2026
3b49279
modify task_runner
Mar 9, 2026
ba8cb08
change rocprim task_runner.py
Mar 9, 2026
87b2508
change rocprim task_runner.py
Mar 12, 2026
6de72f6
modify geak_benchmark parallel
Mar 12, 2026
1db99a0
modify geak_benchmark parallel
Mar 12, 2026
5b25da6
add task configs
Mar 12, 2026
76362bb
add task configs
Mar 12, 2026
df491fa
add readme
Mar 13, 2026
2ca9a6d
Update README.md
yueliu14 Mar 13, 2026
a8674b1
Update README.md
yueliu14 Mar 17, 2026
b00392e
change agent name from geak_benchamrk to geak_v3
Mar 18, 2026
0bdb055
change name geak_benchmark to geak_v3
Mar 18, 2026
a1e872a
Merge main into geak_benchmark: add repo_url support
Mar 18, 2026
c06d30c
Add repo_url support for rocprim tasks
Mar 18, 2026
7621596
change to geak_v3
Mar 18, 2026
7555ff3
GEAK Triton benchmark: geak_v3_triton agent, eval tasks, and fixes
iraj465 Mar 19, 2026
06dbe42
Unify Triton launcher to use geak CLI instead of raw Python modules
iraj465 Mar 19, 2026
739466c
Unify both HIP and Triton launchers to use geak --eval
iraj465 Mar 19, 2026
497b027
Add README for geak_v3_triton agent
iraj465 Mar 19, 2026
0c3660e
Fix patch application path for unified geak CLI output
iraj465 Mar 24, 2026
e6f3829
fix: use verified_speedup from FULL_BENCHMARK instead of select_agent…
iraj465 Mar 25, 2026
68774df
fix: patch apply with variable strip levels and recursive worktree se…
iraj465 Mar 26, 2026
8b92cdf
fix: fast_rms_layernorm harness creates tensor on CPU then moves to CUDA
iraj465 Mar 26, 2026
88be18d
feat: read GEAK final_report.json for deterministic speedup reporting
iraj465 Mar 26, 2026
5c076bb
feat: include per-round GEAK details in task_result.yaml
iraj465 Mar 26, 2026
cbbc086
fix: add @triton.jit to pid_grid and remove unused EVEN_M_N heuristic
iraj465 Mar 26, 2026
658ec8c
fix: use max of benchmark and verified speedup in AKA evaluator
iraj465 Mar 26, 2026
a549cff
fix: use verified_speedup as canonical, revert max() fallback
iraj465 Mar 26, 2026
d78686a
feat: add 8 new GEAK Triton kernel tasks + aiter version pinning
iraj465 Mar 26, 2026
9a46a0c
feat: add mini_swe_triton agent for apple-to-apple comparison with GEAK
iraj465 Mar 26, 2026
dc89e58
feat: tune mini_swe_triton agent config for lightweight single-round …
iraj465 Mar 26, 2026
ccb31dd
refactor: mini_swe_triton as raw agent without GEAK preprocessing
iraj465 Mar 26, 2026
56083b7
fix: add missing os import in main.py
iraj465 Mar 26, 2026
c3de798
fix: detect container-internal execution for aiter checkout
iraj465 Mar 26, 2026
a222231
fix: always reset+clean before aiter checkout to avoid stale file con…
iraj465 Mar 26, 2026
8e57f13
fix: always reset+clean before aiter checkout to avoid stale file con…
iraj465 Mar 26, 2026
324444d
fix: run aiter-dependent tasks locally instead of via docker exec
iraj465 Mar 26, 2026
c3faf15
fix: flatten aiter and op_tests import paths in 8 new kernel harnesses
iraj465 Mar 26, 2026
98a64a9
feat: add dynamic kernel.py loader to new kernel harnesses
iraj465 Mar 26, 2026
4b0d1a5
docs: update README with exact setup/run commands for all 16 kernels
iraj465 Mar 27, 2026
f396f98
feat: update mini_swe_triton config with 8 new kernels
iraj465 Mar 27, 2026
908a204
feat: add nsa_forward kernel task
iraj465 Mar 27, 2026
4dff4c3
feat: add ff_backward and mla_prefill_reduce kernel tasks
iraj465 Mar 27, 2026
58d7b7f
fix: bypass ASM prefill in mla_prefill_reduce harness
iraj465 Mar 27, 2026
0266a8e
fix: rewrite ff_backward harness to use kernel.py directly
iraj465 Mar 27, 2026
9efbbd5
feat: add target_kernel_functions and prompt.instructions to all 19 k…
iraj465 Mar 27, 2026
ec5846b
fix: list ALL @triton.jit functions in target_kernel_functions
iraj465 Mar 27, 2026
e28f867
fix: list only primary kernel functions, exclude helpers
iraj465 Mar 27, 2026
2cb48e0
feat: add 4 refk kernels (identity, fp8_blockwise_mm, mla_decode, moe)
iraj465 Mar 27, 2026
bcf5e2c
track kernel.py for patch generation
iraj465 Mar 28, 2026
14ab798
Add kernel.py for patch tracking
iraj465 Mar 29, 2026
135bc28
Track kernel.py for fused_moe_mxfp4 optimization
iraj465 Mar 29, 2026
e223c35
Add kernel.py for optimization
iraj465 Mar 29, 2026
0101cbd
Track kernel.py for MLA decode optimization
iraj465 Mar 29, 2026
d6c495f
Add kernel.py baseline for GEMM optimization
iraj465 Mar 29, 2026
e703d46
Add sitecustomize.py for namespace stub neutralization
iraj465 Mar 29, 2026
286ab33
Stop tracking workspace dirs, update task prompts
iraj465 Mar 31, 2026
4d446c5
Extend .gitignore to cover all runtime artifacts
iraj465 Mar 31, 2026
f85813a
Use all shapes for task-local benchmark to match verified benchmark
iraj465 Mar 31, 2026
8395782
Make --benchmark use all configs across all remaining harness files
iraj465 Mar 31, 2026
dd11ac5
Add GEAK batch configs and update README with run instructions
iraj465 Apr 1, 2026
11f2afe
fix: add --iterations CLI arg to all harnesses missing it
iraj465 Apr 1, 2026
091fc59
Standardize geak_v3_triton agent to use unified geak CLI
iraj465 Apr 6, 2026
b23d9be
fix: use best benchmark_speedup across all rounds, not just final round
iraj465 Apr 7, 2026
b821fc0
fix: parse total_speedup/best_speedup strings from final_report.json
iraj465 Apr 7, 2026
eb09002
fix: move --iterations out of mutually_exclusive_group in 6 harnesses
iraj465 Apr 7, 2026
39c4656
config: restore GEAK_MAX_ROUNDS=5 as default (override via env for ex…
iraj465 Apr 7, 2026
30c7108
docs: update README and canonical slot configs for all 18 kernels
iraj465 Apr 7, 2026
82ac244
fix: restore standard AKA evaluation flow for all agents
iraj465 Apr 7, 2026
ce19137
fix: standardize compile_command quoting in 4 refk kernel configs
iraj465 Apr 7, 2026
9e04f17
fix: add missing _n_all definition in 3 harnesses
iraj465 Apr 8, 2026
23faa91
fix: address reviewer feedback on evaluation flow and task conventions
iraj465 Apr 9, 2026
1119c9b
config: clean up geak_v3_triton defaults for benchmarking
iraj465 Apr 9, 2026
5a1075b
config: remove GEAK_EXCLUDED_AGENTS, let GEAK handle its own defaults
iraj465 Apr 9, 2026
38a4a1d
fix: use all configs for correctness checks in 10 harnesses
iraj465 Apr 9, 2026
345e71d
fix: rank worktree kernels by speedup in last-resort patch promotion
iraj465 Apr 9, 2026
2875848
fix: pin GPU for post-agent evaluation to match baseline
iraj465 Apr 9, 2026
55eccba
fix: use max(verified, benchmark) speedup for round selection
iraj465 Apr 9, 2026
7f4e9ed
remove 5 non-benchmark kernels, keep 18 target kernels
iraj465 Apr 10, 2026
fc0997d
fix: improve patch application order and add change verification
iraj465 Apr 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,15 @@ agents/geak_optimagentv2/.geak_setup_complete
agents/geak_ourllm_kernel2kernel/GEAK-agent
run.sh
config_*.yaml
!config_geak_triton_mem_*.yaml
tmp*
kill.sh
saved_results
.mcp.json
.mcp.json

*workspace*
ws_mem*
do_task.sh
traj.json
**/baseline_metrics.json
**/profile.json
79 changes: 79 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,85 @@ Review the generated `validation_report.yaml` in the workspace directory. The ta
See [agents/task_validator/README.md](agents/task_validator/README.md) for the full list of validation checks and requirements.


## GEAK Triton Kernel Optimization Runs

Multi-GPU batch optimization of Triton kernels using the GEAK agent with heterogeneous memory configuration and model ensemble.

All runs use: `GEAK_CONFIG_NAME=heterogeneous_memory_on`

### Batch 1 Configs & Commands

**Slot 1 — GPUs 0-3** (`config_geak_triton_mem_slot1_rerun.yaml`):
- `triton2triton/geak_eval/L1/refk_fp8_blockwise_mm`
- `triton2triton/geak_eval/L1/moe_routing_sigmoid_top1`
- `triton2triton/geak_eval/L1/llama_ff_triton`
- `triton2triton/geak_eval/L1/refk_identity`

```bash
GEAK_CONFIG_NAME=heterogeneous_memory_on GEAK_GPU_IDS="0,1,2,3" \
python3 main.py --config_name config_geak_triton_mem_slot1_rerun.yaml \
> /tmp/slot1_run.log 2>&1 &
```

**Slot 2 — GPUs 4-7** (`config_geak_triton_mem_slot2_rerun.yaml`):
- `triton2triton/geak_eval/L2/topk`
- `triton2triton/geak_eval/L2/lean_atten_paged`
- `triton2triton/geak_eval/L2/fast_rms_layernorm`
- `triton2triton/geak_eval/L1/mla_decode`

```bash
GEAK_CONFIG_NAME=heterogeneous_memory_on GEAK_GPU_IDS="4,5,6,7" \
python3 main.py --config_name config_geak_triton_mem_slot2_rerun.yaml \
> /tmp/slot2_run.log 2>&1 &
```

### Batch 2 Configs & Commands

**Slot 1 — GPUs 0-3** (`config_geak_triton_mem_slot1_batch2.yaml`):
- `triton2triton/geak_eval/L1/fused_append_shared_experts`
- `triton2triton/geak_eval/L2/ff_backward`
- `triton2triton/geak_eval/L3/gemm_a16w16_atomic`
- `triton2triton/geak_eval/L3/fused_qkv_rope`
- `triton2triton/geak_eval/L3/fused_mxfp4_quant_moe_sort`

```bash
GEAK_CONFIG_NAME=heterogeneous_memory_on GEAK_GPU_IDS="0,1,2,3" \
python3 main.py --config_name config_geak_triton_mem_slot1_batch2.yaml \
> /tmp/slot1_b2_run.log 2>&1 &
```

**Slot 2 — GPUs 4-7** (`config_geak_triton_mem_slot2_batch2.yaml`):
- `triton2triton/geak_eval/L3/gemm`
- `triton2triton/geak_eval/L3/gemm_a16wfp4`
- `triton2triton/geak_eval/L3/fused_moe_mxfp4`
- `triton2triton/geak_eval/L3/fused_qk_rope_cache_mla`
- `triton2triton/geak_eval/L3/fused_rms_fp8`

```bash
GEAK_CONFIG_NAME=heterogeneous_memory_on GEAK_GPU_IDS="4,5,6,7" \
python3 main.py --config_name config_geak_triton_mem_slot2_batch2.yaml \
> /tmp/slot2_b2_run.log 2>&1 &
```

### Monitoring

```bash
# Check processes
ps aux | grep "main.py" | grep -v grep

# Tail logs (batch 1)
tail -20 /tmp/slot1_run.log
tail -20 /tmp/slot2_run.log

# Tail logs (batch 2)
tail -20 /tmp/slot1_b2_run.log
tail -20 /tmp/slot2_b2_run.log

# Check completed results
find ws_mem*/ -name "geak_summary.json" -exec echo "=== {} ===" \; -exec cat {} \;
```


## Next Steps

- Enhance A/B Testing with Better Interactivity and User Experience
Expand Down
19 changes: 15 additions & 4 deletions agents/SWE_agent/launch_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,21 @@ def launch_agent(eval_config: dict[str, Any], task_config_dir: str, workspace: s
# copy the script python_bindings/tritonbench.py into the workspace
shutil.copy(tritonbench_script_path, os.path.join(workspace, "python_bindings", "tritonbench.py"))
if any("rocprim" in task for task in eval_config["tasks"]):
subprocess.run(
["git", "clone", "https://github.com/ROCm/rocPRIM.git", os.path.join(workspace, "rocPRIM")],
check=True
)
for task in eval_config["tasks"]:
if "rocprim" not in task:
continue
repo_dir = Path(workspace) / "tasks" / task / "rocPRIM"
if (repo_dir / ".git").exists():
logger.info(f"Repository already exists at {repo_dir}, skipping clone")
continue
if repo_dir.exists():
logger.info(f"Repository directory already exists at {repo_dir}, skipping clone")
continue
repo_dir.parent.mkdir(parents=True, exist_ok=True)
subprocess.run(
["git", "clone", "https://github.com/ROCm/rocPRIM.git", str(repo_dir)],
check=True,
)
test_correctness_benchmark_path = Path(task_config_dir).parent / "python_bindings" / "test_correctness_benchmark.py"
# make a dir for the target path
os.makedirs(os.path.join(workspace, "python_bindings"), exist_ok=True)
Expand Down
88 changes: 88 additions & 0 deletions agents/geak_v3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
## `GEAK-V3`

This agent template integrates **GEAK v3** into AgentKernelArena so you can run AgentKernelArena tasks using GEAK-v3 as the optimizing agent.

### 1) Install GEAK

GEAK provides the `geak` CLIs. Install it in your Python environment:

```bash
cd /path/to/GEAK
pip install -e .
```

### 2) Configure AMD LLM environment variables

```bash
export AMD_LLM_API_KEY="your-key-here"
```

### 3) Configure the GEAK runner in geak_v3

Edit `agents/geak_v3/agent_config.yaml`.

Key fields:
- **`run.cmd`**: which executable to run `geak`
- **`run.configs`**: CLI options passed to that executable

Example:

```yaml
run:
cmd: geak
configs: "-c geak.yaml --yolo --num-parallel=2 --gpu-ids=0,1"
```

Notes:
- `-c geak.yaml` points to `agents/geak_v3/geak.yaml` (the launcher automatically resolves it to an absolute path).
- `--num-parallel` / `--gpu-ids` controls **parallel sub-agents inside a single task** (multi-GPU). This does *not* change how AgentKernelArena schedules tasks (see the “Tasks run serially” note below).
- If you want to use a different `agent_config.yaml` without editing the repo, set:

```bash
export GEAK_AGENT_CONFIG="/abs/path/to/agent_config.yaml"
```

### 4) Configure tasks in AgentKernelArena

Edit `AgentKernelArena/config.yaml`:

1) Select this agent template:

```yaml
agent:
template: geak_v3
```

2) Select tasks to run (task names are relative to `tasks/`):

Here are tasks of hip kernels:
```yaml
tasks:
- hip2hip/others/
- repository/rocprim/block_radix_rank
- repository/rocprim/device_binary_search
- repository/rocprim/device_search_n
- repository/rocprim/device_merge_sort
```

### 5) Run

From the `AgentKernelArena/` directory:

```bash
python3 main.py
```

### 6) Where to find results

Quick checklist:

- **AgentKernelArena Run log**: `logs/*.log` (path controlled by `log_directory` in `AgentKernelArena/config.yaml`)
- **Workspace root**: `workspace_<GPU>_geak_v3/` (you can rename it by changing `workspace_directory_prefix` in `AgentKernelArena/config.yaml`)
- **Per-task results**: `workspace_.../<task>_<timestamp>/task_result.yaml` (also `baseline_perf.yaml`, `optimized_perf.yaml`, `build/performance_report.json`)
- **GEAK logs**: `workspace_.../<task>_<timestamp>_logs/` (see `best_results.json`, `parallel_*/`)
- **Aggregate summary**: `workspace_.../task_results_summary.csv` (and sometimes `task_results_report.txt`)

### Important: tasks run serially

In AgentKernelArena, the `tasks:` list is executed **sequentially (one task at a time)**. If you want overall throughput, add more GPUs to **GEAK parallelism inside each task** via `--num-parallel` and `--gpu-ids`.
4 changes: 4 additions & 0 deletions agents/geak_v3/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright(C) [2026] Advanced Micro Devices, Inc. All rights reserved.
from agents.geak_v3.launch_agent import launch_agent

__all__ = ["launch_agent"]
8 changes: 8 additions & 0 deletions agents/geak_v3/agent_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
version: 0

# Agent timeout settings
timeout_seconds: 36000
python_path: python3

run:
configs: '-c geak.yaml --yolo --num-parallel=2 --gpu-ids=0,1'
31 changes: 31 additions & 0 deletions agents/geak_v3/geak.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
agent:
step_limit: 0.
cost_limit: 0.
mode: confirm
env:
env:
PAGER: cat
MANPAGER: cat
LESS: -R
PIP_PROGRESS_BAR: 'off'
TQDM_DISABLE: '1'
timeout: 3600
model:
model_class: amd_llm
# claude-opus-4.5, claude-sonnet-4.5, gpt-5.1, gpt-5, gpt-5-codex
model_name: claude-opus-4.5
api_key: ""
# model_kwargs:
# temperature: 0.0
# max_tokens: 16000
# # reasoning is only valid for gpt models, can be set to none, low, medium, high
# reasoning:
# effort: high
# # text is only valid for gpt models, can be set to low or high. determines how many output tokens are generated
# text:
# verbosity: low

tools:
profiling: false
profiling_type: profiling
strategy_manager: true
Loading