AMD Kernel Agent Skill Pack

An engineered Skill + knowledge base that enables existing LLM agents (Claude / Cursor / Qwen, etc.) to perform high-performance kernel optimization on AMD GPUs — no additional model training required.

Core Philosophy

No model training — pure engineering: structured domain knowledge injection + verification loop + knowledge accumulation

After reading SKILL.md, the agent can:

Auto-detect target hardware (MI300X / MI355X) and load architecture-specific knowledge
Route to a specialized sub-skill by programming path (Triton / HIP C++ / CK)
Reference SOTA code patterns, using existing best implementations as baseline and identifying directions to surpass them
Follow a five-step workflow (Analyze -> Implement -> Verify -> Iterate -> Accumulate), with automatic logging each round
Persist through failures with at least 3 retry attempts, consulting the knowledge base for solutions

Project Structure

amd-kernel-skill/
├── SKILL.md                          # Main routing skill (hardware detection -> path dispatch -> constraints)
├── README.md                         # This file
│
├── skills/                           # 3 specialized sub-skills by programming path
│   ├── triton-kernel/SKILL.md        #   Triton ROCm (autotune, matrix_instr_nonkdim, AMD passes)
│   ├── hip-kernel/SKILL.md           #   HIP C++ (MFMA intrinsics, LDS, profiling)
│   └── ck-kernel/SKILL.md            #   Composable Kernel (pipeline selection, tile configuration)
│
├── references/                       # 5-layer knowledge base (35 structured docs)
│   ├── hardware/                     #   Layer 1: Hardware architecture (MI300X/MI325X/MI355X/MI350X)
│   ├── isa/                          #   Layer 2: ISA instruction-level (MFMA, memory, registers, scheduling, inline asm)
│   ├── toolchain/                    #   Layer 3: Toolchain (profiling decision tree, rocprof, hipcc, Triton)
│   ├── libraries/                    #   Layer 4: Libraries & APIs (GEMM tuning, AITER, CK, RCCL)
│   └── optimization/                 #   Layer 5: Optimization patterns (SOTA recipes, advanced techniques, common mistakes)
│
├── scripts/                          # Verification scripts
│   ├── verify_correctness.py         #   Correctness verification (multiple inputs + dtype-graded precision)
│   └── benchmark_kernel.py           #   Performance measurement (warmup + median + structured output)
│
├── templates/                        # Code skeletons
│   ├── triton_kernel_template.py     #   Triton kernel starting point (with AMD autotune config)
│   ├── hip_kernel_template.cpp       #   HIP kernel + PyTorch binding starting point
│   └── benchmark_template.py         #   General benchmark script
│
├── agent_output/                     # Agent iteration records (auto-generated)
│   └── <OP>/<backend>/round-N/       #   Per round: kernel + correctness.log + benchmark.log + summary.md
│
├── rocm-related-pdf/                 # 9 official AMD PDFs (ISA specs, whitepapers, product guides)
├── docs/superpowers/                 # Design documents and implementation plans
│
└── vendor/                           # Reference codebases (git submodules)
    ├── composable_kernel/            #   ROCm/composable_kernel — CK pipeline/tile source
    ├── aiter/                        #   ROCm/aiter — AITER optimized operators
    ├── rocm-examples/                #   amd/rocm-examples — HIP programming patterns
    └── triton/                       #   ROCm/triton — Triton ROCm backend

Knowledge Base Scale

Metric	Value
Structured reference docs	35 documents (~6800 lines)
SOTA kernel recipes	7 production-grade code patterns (FP8 GEMM 2600T, RMSNorm, MoE, ping-pong...)
Profiling decision tree	1 mechanical bottleneck -> action flowchart
Raw crawled data	143+ pages of ROCm docs/blogs
Ingested PDFs	9 (CDNA3/4 ISA + whitepapers + product guides)
Reference codebases	4 submodules (CK, AITER, rocm-examples, Triton)
Data sources	ROCm docs, ROCm blogs, GPUOpen, AMD whitepapers, CK/AITER source analysis

Quick Start

Method 1: Install Script (Recommended)

# Install to a Cursor project
./install.sh cursor /path/to/your/project

# Install to a Claude Code project
./install.sh claude /path/to/your/project

# Install to any agent configuration directory
./install.sh custom /path/to/target/skills/dir

Method 2: Manual Copy

# Cursor
cp -r SKILL.md skills/ references/ scripts/ templates/ /path/to/project/.cursor/skills/amd-kernel/

# Claude Code
cp -r SKILL.md skills/ references/ scripts/ templates/ /path/to/project/.claude/skills/amd-kernel/

Method 3: Symlink (Development Mode, Live Sync)

./install.sh cursor-link /path/to/your/project

Usage After Installation

Simply give the agent a task:

"Optimize triton/attention/attention_kernel.py, target 10% faster than torch.compile, hardware MI300X"

"Implement a fused LayerNorm+SwiGLU kernel in HIP C++ for MI355X"

"Tune CK GEMM tile configuration, find optimal parameters for BF16 GEMM M=4096 N=4096 K=8192"

"Analyze profiling results — the kernel's MFMA utilization is only 30%, help find the bottleneck and optimize"

Supported Target Hardware

GPU	Architecture	offload-arch	Status
MI300X	CDNA3	`gfx942`	Full support
MI325X	CDNA3	`gfx942`	Full support
MI355X	CDNA4	`gfx950`	Full support
MI350X	CDNA4	`gfx950`	Full support

License

Knowledge base content is compiled from AMD public documentation, ROCm open-source projects, and official blogs. Submodules under vendor/ each follow their respective original licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMD Kernel Agent Skill Pack

Core Philosophy

Project Structure

Knowledge Base Scale

Quick Start

Method 1: Install Script (Recommended)

Method 2: Manual Copy

Method 3: Symlink (Development Mode, Live Sync)

Usage After Installation

Supported Target Hardware

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
agent_output		agent_output
docs/superpowers		docs/superpowers
references		references
rocm-related-pdf		rocm-related-pdf
rules		rules
scripts		scripts
skills		skills
templates		templates
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
SKILL.md		SKILL.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

AMD Kernel Agent Skill Pack

Core Philosophy

Project Structure

Knowledge Base Scale

Quick Start

Method 1: Install Script (Recommended)

Method 2: Manual Copy

Method 3: Symlink (Development Mode, Live Sync)

Usage After Installation

Supported Target Hardware

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages