Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions .github/agents/skills/accordo/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
name: accordo-validation
description: Validate GPU kernel correctness by comparing reference and optimized outputs. Use when verifying that an optimized or modified kernel matches a reference implementation.
---

# Accordo: GPU Kernel Validation

Capture and compare kernel outputs from reference and optimized binaries to validate correctness. Uses kernelDB for automatic kernel extraction; supports configurable tolerance and execution-time comparison.

## When to Use

- User has a reference and an optimized (or modified) GPU kernel and wants to check they produce the same results
- Regression testing after kernel or build changes
- Validating multiple optimization variants against one baseline

## Instructions

1. **Require two or more binaries:** one reference (e.g. `./app_ref`) and one or more to validate (e.g. `./app_opt`). All must expose the same kernel by name.
2. **Ensure binaries are built with debug symbols** (`-g`) so kernel arguments can be extracted.
3. **Choose execution path:**
- If an Accordo MCP server is available, call its `validate_kernel_correctness` tool, which performs capture-and-compare with the same semantics described below.
- Otherwise use the Python API or the `accordo validate` CLI (`accordo validate --help` for flags: `--kernel-name`, `--ref-binary`, `--opt-binary`, `--tolerance`, `--timeout`, `--working-dir`, `--kernel-args`, `--log-level`).

### Python API

```python
from accordo import Accordo

# Validator for the kernel to validate (binary used to extract signature)
validator = Accordo(binary="./app_ref", kernel_name="reduce_sum")

# Optional: set working directory if binaries expect it
validator = Accordo(binary="./app_ref", kernel_name="reduce_sum", working_directory="./run")

# Capture snapshots
ref = validator.capture_snapshot(binary="./app_ref")
opt = validator.capture_snapshot(binary="./app_opt")

# Compare with tolerance (default 1e-6)
result = validator.compare_snapshots(ref, opt, tolerance=1e-6)

if result.is_valid:
print("PASS:", result.num_arrays_validated, "arrays matched")
else:
print(result.summary())
```

For multiple optimizations, capture the reference once and compare each optimized snapshot against it.

### Snapshot and result attributes

- **Snapshot:** `arrays`, `execution_time_ms`, `grid_size`, `block_size`
- **ValidationResult:** `is_valid`, `num_arrays_validated`, `num_mismatches`, `mismatches`, `success_rate`; use `summary()` for a human-readable report.

## Workflow

1. Build reference and optimized binaries with the same kernel name and `-g`.
2. Create an `Accordo(binary=ref_binary, kernel_name="...")` validator; set `working_directory` if needed.
3. Capture reference snapshot with `capture_snapshot(binary=ref_binary)`.
4. For each variant, capture with `capture_snapshot(binary=opt_binary)` and compare with `compare_snapshots(ref, opt, tolerance=...)`.
5. If `result.is_valid` is false, use `result.summary()` and `result.mismatches` to diagnose.
6. Use relative paths for binaries and working directory so the skill is portable.

## Notes

- kernelDB is used automatically; no separate kernelDB setup is required when using the Python API.
- Increase `tolerance` for floating-point comparisons when appropriate (e.g. 1e-4 or 1e-5 for single precision).
- Use `timeout_seconds` in `capture_snapshot` if the run may hang.
131 changes: 131 additions & 0 deletions .github/agents/skills/kerncap/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
name: test-kerncap
description: Test local kerncap changes end-to-end by profiling an application, extracting a kernel, and validating the reproducer. Use when the user asks to test kerncap against any HIP or Triton workload, or wants to validate extraction on a real GPU application.
---

# Test kerncap Against an Application

Test local kerncap changes end-to-end by extracting and validating a kernel from any application.

## Parameters

| Parameter | Required | Description |
|-----------|----------|-------------|
| `app_cmd` | **Yes** | Full command to run the application (binary + arguments), e.g. `$WORK/dev/llama.cpp/build/bin/llama-bench -m model.gguf -p 512 -n 32` |
| `conda_env` | No | Conda environment to activate before running commands (e.g. `llama_cpp`). If not provided, use the current environment. |
| `kernel_name` | No | Name of the kernel to extract (e.g. `mul_mat_q`). If not provided, profile the application first and select the top kernel by execution time. |

## Paths

| Item | Path |
|------|------|
| kerncap source | `kerncap/` (relative to IntelliKit repo root) |
| Output directory | `/tmp/kerncap-test/<kernel_name>` |

## Environment Setup

If `conda_env` is provided, activate it before any other step:

```bash
conda activate <conda_env>
```

If already in a different environment, switch explicitly. Do not assume the current shell environment is correct.

If `conda_env` is not provided, proceed with the current environment as-is.

## Workflow

### Step 1: Reinstall kerncap

Ensure the correct environment is active (if applicable), then uninstall and reinstall to pick up local changes:

```bash
pip uninstall kerncap -y && pip install kerncap/
```

### Step 2: Profile to identify target kernel

**If `kernel_name` was provided**: Skip this step and proceed to Step 3.

**If `kernel_name` was not provided**: Run profiling to discover the top bottleneck kernel:

```bash
kerncap profile -- <app_cmd>
```

Select the kernel with the highest total execution time from the profile output. Use its name as `kernel_name` for all subsequent steps. Tell the user which kernel was selected and why.

**Important**: Use a sufficiently long substring from the profile output as `kernel_name` so that `kerncap extract` matches the intended kernel, not a different instantiation. For example, templated kernels like `mul_mat_q` have many instantiations differing only by template parameters; passing just `mul_mat_q` will capture the first dispatch that matches, which may not be the top-ranked one. Prefer including template parameters in the substring (e.g. `mul_mat_q<(ggml_type)39` instead of `mul_mat_q`).

### Step 3: Extract the kernel

```bash
kerncap extract --help
```

Use the help output to construct the appropriate `kerncap extract` command for the application. Key flags to determine:

- `--cmd` — the application command (`app_cmd`)
- `--source-dir` — where the kernel source lives (ask the user if unclear)
- `--output` — `/tmp/kerncap-test/<kernel_name>`
- `--language` — `hip` or `triton` depending on the workload
- Any additional flags (`-D` defines, `--dispatch`, etc.)

**If extraction fails or produces errors**: Stop here and report the full error output. This indicates the local kerncap changes have a bug that needs fixing.

**If extraction succeeds**: Inspect the output directory for expected files (metadata.json, argument dumps, source files). If the output looks reasonable, proceed to compile and run.

### Step 4: Compile and run the reproducer

Navigate to the output directory and build/run the reproducer:

```bash
cd /tmp/kerncap-test/<kernel_name>
make run
```

**If `make run` fails**: Stop here and report the full compiler or runtime error output. This is the primary signal that kerncap generated an incorrect reproducer.

**If `make run` succeeds**: Proceed to validation.

### Step 5: Validate the reproducer

**5a. Smoke test** — confirm baseline replay works:

```bash
kerncap validate /tmp/kerncap-test/<kernel_name>
```

This is a smoke test only (VA-faithful captures). It confirms the replay runs without crashing but does not check numerical correctness.

**5b. Recompile** — build a baseline HSACO from the unmodified kernel source:

```bash
cd /tmp/kerncap-test/<kernel_name>
make recompile
```

This confirms the VFS-overlay recompile pipeline works. It produces `optimized.hsaco` from the unmodified `kernel_variant.cpp`.

**If `make recompile` fails**: Stop here and report the error. This indicates an issue with the source finder or VFS overlay generation.

**5c. Correctness validation** — compare recompiled HSACO against captured baseline:

```bash
kerncap validate /tmp/kerncap-test/<kernel_name> --hsaco /tmp/kerncap-test/<kernel_name>/optimized.hsaco
```

This runs replay twice (captured HSACO vs recompiled HSACO) and compares outputs byte-for-byte. Since the kernel source is unmodified, they should match exactly. A failure here indicates a recompilation fidelity issue.

### Step 6: Report results

Summarize:
- Whether reinstall succeeded
- Whether profiling identified a kernel (if applicable, and which one)
- Whether extraction completed (and any warnings)
- Whether `make run` compiled and executed successfully
- Whether smoke test passed (Step 5a)
- Whether recompile succeeded (Step 5b)
- Whether correctness validation passed (Step 5c)
- Any errors or warnings encountered at each step
98 changes: 98 additions & 0 deletions .github/agents/skills/linex/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
name: linex-profiling
description: Profile GPU kernels at source-line granularity with cycle-level timing and stall analysis. Use when identifying performance hotspots at the source code level or analyzing instruction-level metrics mapped to source lines.
---

# Linex: Source-Level GPU Performance Profiling

Map GPU performance metrics to your source code lines. Get cycle-level timing, stall analysis, and instruction-level metrics for each line of source code.

## When to Use

- User asks to profile a GPU application at source-line granularity
- Need to identify which specific lines of code are performance bottlenecks
- Analyzing stall patterns and execution bottlenecks at the source level
- Understanding cycle-level timing for each line of code
- Instruction-level analysis mapped to source lines

## Instructions

1. **Ensure the target runs on AMD ROCm 7.0+** with `rocprofv3` available.
2. **Kernels must be compiled with `-g`** (debug symbols) for source mapping.
3. **Choose execution path:**
- If a Linex MCP server is available, use its MCP tools:
- `profile_application` to run and profile a target application with the options below.
- `analyze_instruction_hotspots` to perform instruction-level hotspot analysis on collected profiles.
- Otherwise use the Python API from the environment where Linex is installed.

### Python API

```python
from linex import Linex

profiler = Linex(
target_cu=0, # Target compute unit
shader_engine_mask="0xFFFFFFFF", # All shader engines
activity=10, # Activity counter polling
)

profiler.profile("./my_app", kernel_filter="my_kernel")

# Show hotspots (sorted by total_cycles)
for line in profiler.source_lines[:5]:
print(f"{line.file}:{line.line_number}")
print(f" {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")
print(f" Executed {line.execution_count} times")

# Find memory-bound lines
memory_bound = [
l for l in profiler.source_lines
if l.stall_percent > 50
]

# Instruction-level analysis
for line in profiler.source_lines[:1]:
for inst in line.instructions:
print(f"{inst.isa}: {inst.latency_cycles} cycles")
```

### SourceLine Properties

- `file` - Source file path
- `line_number` - Line number
- `total_cycles` - Sum of all instruction cycles
- `stall_cycles` - Cycles spent waiting
- `idle_cycles` - Cycles slot was idle
- `execution_count` - Total executions
- `instructions` - List of ISA instructions
- `stall_percent` - Convenience: stall_cycles / total_cycles * 100

### InstructionData Properties

- `isa` - ISA instruction text
- `latency_cycles` - Total cycles for this instruction
- `stall_cycles` - Cycles spent waiting
- `idle_cycles` - Cycles slot was idle
- `execution_count` - How many times it ran
- `instruction_address` - Virtual address in GPU memory
- `file` - Parsed from source_location
- `line` - Parsed from source_location
- `stall_percent` - Convenience: stall_cycles / latency_cycles * 100

## Workflow

1. Ensure the target binary is built with `-g` (debug symbols) for source mapping.
2. Create a `Linex()` profiler; optionally set `target_cu`, `shader_engine_mask`, or `activity`.
3. Call `profiler.profile(command, kernel_filter=...)` to run profiling.
4. Access `profiler.source_lines` (sorted by total_cycles) to find hotspots.
5. Use `line.stall_percent` to identify memory-bound or dependency-bound lines.
6. Drill down into `line.instructions` for instruction-level analysis.
7. Use relative paths for the target binary so the skill is portable.

## Notes

- Requires ROCm 7.0+ with `rocprofv3` support.
- Source mapping requires kernels compiled with `-g` (debug symbols).
- `source_lines` are automatically sorted by `total_cycles` (descending).
- Use `kernel_filter` to profile specific kernels by name (regex pattern).
- For Triton or other frameworks, ensure debug symbols are available in the compiled output.
76 changes: 76 additions & 0 deletions .github/agents/skills/metrix/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
name: metrix-profiling
description: Profile GPU kernels when performance analysis or optimization is required. Use for AMD ROCm GPU metrics, bandwidth, cache hit rates, coalescing, or kernel timing.
---

# Metrix: GPU Profiling

Profile AMD GPU kernels and get human-readable metrics (bandwidth, cache, coalescing, FLOPS). Architecture is auto-detected.

## When to Use

- User asks to profile a GPU application or kernel
- Performance analysis, optimization, or bottleneck investigation
- Need HBM/L2/L1 bandwidth, hit rates, or compute metrics
- Need timing-only runs (fast, no hardware counters)

## Instructions

1. **Ensure the target runs on AMD ROCm** (e.g. `hipcc`-built binary or Python script that launches HIP/ROCm kernels).
2. **Choose execution path:**
- If a Metrix MCP server is available, use its profile tool with the same options below.
- Otherwise run the CLI or Python API from the environment where Metrix is installed.

### CLI

From the project or install prefix:

```bash
# Profile with all metrics (auto-detected arch)
metrix ./my_app

# Time only (fast, no counters)
metrix --time-only -n 10 ./my_app

# Filter kernels by name
metrix --kernel matmul ./my_app

# Specific metrics
metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency,compute.total_flops ./my_app

# Save to JSON/CSV
metrix -o results.json ./my_app
```

Options: `--profile`/`-p` (run `metrix list profiles` for names: `quick`, `memory`, `memory_bandwidth`, `memory_cache`, `compute`), `--metrics`/`-m`, `--time-only`, `--kernel`/`-k` (regular expression), `--num-replays`/`-n`, `--output`/`-o`, `--top`, `--aggregate`, `--timeout`, `--no-counters`, `--log`/`-l`, `--quiet`/`-q`. Discovery: `metrix list <metrics|profiles|devices>`, `metrix info <metric|profile> <name>`. Note: `metrix list counters` and `metrix info counter <name>` are not implemented yet (CLI reports “not yet implemented”).

### Python API

```python
from metrix import Metrix

profiler = Metrix()
results = profiler.profile("./my_app", num_replays=5)

for kernel in results.kernels:
print(kernel.name, kernel.duration_us.avg)
for metric, stats in kernel.metrics.items():
print(f" {metric}: {stats.avg}")
```

Use `metrics=[...]` for a subset; omit for all metrics. Use `cwd` when the binary expects a specific working directory.

## Workflow

1. Identify the executable or script to profile (e.g. `./app` or `python run_kernels.py`).
2. If only timing is needed, use `--time-only` for speed.
3. If full metrics are needed, run `metrix ./app` (or MCP equivalent); optionally restrict with `--kernel` or `--metrics`.
4. Interpret results: low L2 hit rate, low coalescing, or low HBM utilization suggest optimization targets.
5. For automation or tooling, use `-o results.json` and parse the JSON output.

## Key Metrics (reference)

- **Memory:** `memory.hbm_bandwidth_utilization`, `memory.l2_hit_rate`, `memory.l1_hit_rate`, `memory.coalescing_efficiency`, `memory.global_load_efficiency`, `memory.lds_bank_conflicts`, `memory.atomic_latency`
- **Compute:** `compute.total_flops`, `compute.hbm_gflops`, `compute.hbm_arithmetic_intensity`, `compute.l2_arithmetic_intensity`, `compute.l1_arithmetic_intensity`

Use relative paths for the target binary and output files so the skill is portable across environments.
Loading