ROCm · mawad-amd · Apr 16, 2026 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
@@ -0,0 +1,68 @@
+---
+name: accordo-validation
+description: Validate GPU kernel correctness by comparing reference and optimized outputs. Use when verifying that an optimized or modified kernel matches a reference implementation.
+---
+
+# Accordo: GPU Kernel Validation
+
+Capture and compare kernel outputs from reference and optimized binaries to validate correctness. Uses kernelDB for automatic kernel extraction; supports configurable tolerance and execution-time comparison.
+
+## When to Use
+
+- User has a reference and an optimized (or modified) GPU kernel and wants to check they produce the same results
+- Regression testing after kernel or build changes
+- Validating multiple optimization variants against one baseline
+
+## Instructions
+
+1. **Require two or more binaries:** one reference (e.g. `./app_ref`) and one or more to validate (e.g. `./app_opt`). All must expose the same kernel by name.
+2. **Ensure binaries are built with debug symbols** (`-g`) so kernel arguments can be extracted.
+3. **Choose execution path:**
+   - If an Accordo MCP server is available, call its `validate_kernel_correctness` tool, which performs capture-and-compare with the same semantics described below.
+   - Otherwise use the Python API or the `accordo validate` CLI (`accordo validate --help` for flags: `--kernel-name`, `--ref-binary`, `--opt-binary`, `--tolerance`, `--timeout`, `--working-dir`, `--kernel-args`, `--log-level`).
+
+### Python API
+
+```python
+from accordo import Accordo
+
+# Validator for the kernel to validate (binary used to extract signature)
+validator = Accordo(binary="./app_ref", kernel_name="reduce_sum")
+
+# Optional: set working directory if binaries expect it
+validator = Accordo(binary="./app_ref", kernel_name="reduce_sum", working_directory="./run")
+
+# Capture snapshots
+ref = validator.capture_snapshot(binary="./app_ref")
+opt = validator.capture_snapshot(binary="./app_opt")
+
+# Compare with tolerance (default 1e-6)
+result = validator.compare_snapshots(ref, opt, tolerance=1e-6)
+
+if result.is_valid:
+    print("PASS:", result.num_arrays_validated, "arrays matched")
+else:
+    print(result.summary())
+```
+
+For multiple optimizations, capture the reference once and compare each optimized snapshot against it.
+
+### Snapshot and result attributes
+
+- **Snapshot:** `arrays`, `execution_time_ms`, `grid_size`, `block_size`
+- **ValidationResult:** `is_valid`, `num_arrays_validated`, `num_mismatches`, `mismatches`, `success_rate`; use `summary()` for a human-readable report.
+
+## Workflow
+
+1. Build reference and optimized binaries with the same kernel name and `-g`.
+2. Create an `Accordo(binary=ref_binary, kernel_name="...")` validator; set `working_directory` if needed.
+3. Capture reference snapshot with `capture_snapshot(binary=ref_binary)`.
+4. For each variant, capture with `capture_snapshot(binary=opt_binary)` and compare with `compare_snapshots(ref, opt, tolerance=...)`.
+5. If `result.is_valid` is false, use `result.summary()` and `result.mismatches` to diagnose.
+6. Use relative paths for binaries and working directory so the skill is portable.
+
+## Notes
+
+- kernelDB is used automatically; no separate kernelDB setup is required when using the Python API.
+- Increase `tolerance` for floating-point comparisons when appropriate (e.g. 1e-4 or 1e-5 for single precision).
+- Use `timeout_seconds` in `capture_snapshot` if the run may hang.
@@ -0,0 +1,131 @@
+---
+name: test-kerncap
+description: Test local kerncap changes end-to-end by profiling an application, extracting a kernel, and validating the reproducer. Use when the user asks to test kerncap against any HIP or Triton workload, or wants to validate extraction on a real GPU application.
+---
+
+# Test kerncap Against an Application
+
+Test local kerncap changes end-to-end by extracting and validating a kernel from any application.
+
+## Parameters
+
+| Parameter | Required | Description |
+|-----------|----------|-------------|
+| `app_cmd` | **Yes** | Full command to run the application (binary + arguments), e.g. `$WORK/dev/llama.cpp/build/bin/llama-bench -m model.gguf -p 512 -n 32` |
+| `conda_env` | No | Conda environment to activate before running commands (e.g. `llama_cpp`). If not provided, use the current environment. |
+| `kernel_name` | No | Name of the kernel to extract (e.g. `mul_mat_q`). If not provided, profile the application first and select the top kernel by execution time. |
+
+## Paths
+
+| Item | Path |
+|------|------|
+| kerncap source | `kerncap/` (relative to IntelliKit repo root) |
+| Output directory | `/tmp/kerncap-test/<kernel_name>` |
+
+## Environment Setup
+
+If `conda_env` is provided, activate it before any other step:
+
+```bash
+conda activate <conda_env>
+```
+
+If already in a different environment, switch explicitly. Do not assume the current shell environment is correct.
+
+If `conda_env` is not provided, proceed with the current environment as-is.
+
+## Workflow
+
+### Step 1: Reinstall kerncap
+
+Ensure the correct environment is active (if applicable), then uninstall and reinstall to pick up local changes:
+
+```bash
+pip uninstall kerncap -y && pip install kerncap/
+```
+
+### Step 2: Profile to identify target kernel
+
+**If `kernel_name` was provided**: Skip this step and proceed to Step 3.
+
+**If `kernel_name` was not provided**: Run profiling to discover the top bottleneck kernel:
+
+```bash
+kerncap profile -- <app_cmd>
+```
+
+Select the kernel with the highest total execution time from the profile output. Use its name as `kernel_name` for all subsequent steps. Tell the user which kernel was selected and why.
+
+**Important**: Use a sufficiently long substring from the profile output as `kernel_name` so that `kerncap extract` matches the intended kernel, not a different instantiation. For example, templated kernels like `mul_mat_q` have many instantiations differing only by template parameters; passing just `mul_mat_q` will capture the first dispatch that matches, which may not be the top-ranked one. Prefer including template parameters in the substring (e.g. `mul_mat_q<(ggml_type)39` instead of `mul_mat_q`).
+
+### Step 3: Extract the kernel
+
+```bash
+kerncap extract --help
+```
+
+Use the help output to construct the appropriate `kerncap extract` command for the application. Key flags to determine:
+
+- `--cmd` — the application command (`app_cmd`)
+- `--source-dir` — where the kernel source lives (ask the user if unclear)
+- `--output` — `/tmp/kerncap-test/<kernel_name>`
+- `--language` — `hip` or `triton` depending on the workload
+- Any additional flags (`-D` defines, `--dispatch`, etc.)
+
+**If extraction fails or produces errors**: Stop here and report the full error output. This indicates the local kerncap changes have a bug that needs fixing.
+
+**If extraction succeeds**: Inspect the output directory for expected files (metadata.json, argument dumps, source files). If the output looks reasonable, proceed to compile and run.
+
+### Step 4: Compile and run the reproducer
+
+Navigate to the output directory and build/run the reproducer:
+
+```bash
+cd /tmp/kerncap-test/<kernel_name>
+make run
+```
+
+**If `make run` fails**: Stop here and report the full compiler or runtime error output. This is the primary signal that kerncap generated an incorrect reproducer.
+
+**If `make run` succeeds**: Proceed to validation.
+
+### Step 5: Validate the reproducer
+
+**5a. Smoke test** — confirm baseline replay works:
+
+```bash
+kerncap validate /tmp/kerncap-test/<kernel_name>
+```
+
+This is a smoke test only (VA-faithful captures). It confirms the replay runs without crashing but does not check numerical correctness.
+
+**5b. Recompile** — build a baseline HSACO from the unmodified kernel source:
+
+```bash
+cd /tmp/kerncap-test/<kernel_name>
+make recompile
+```
+
+This confirms the VFS-overlay recompile pipeline works. It produces `optimized.hsaco` from the unmodified `kernel_variant.cpp`.
+
+**If `make recompile` fails**: Stop here and report the error. This indicates an issue with the source finder or VFS overlay generation.
+
+**5c. Correctness validation** — compare recompiled HSACO against captured baseline:
+
+```bash
+kerncap validate /tmp/kerncap-test/<kernel_name> --hsaco /tmp/kerncap-test/<kernel_name>/optimized.hsaco
+```
+
+This runs replay twice (captured HSACO vs recompiled HSACO) and compares outputs byte-for-byte. Since the kernel source is unmodified, they should match exactly. A failure here indicates a recompilation fidelity issue.
+
+### Step 6: Report results
+
+Summarize:
+- Whether reinstall succeeded
+- Whether profiling identified a kernel (if applicable, and which one)
+- Whether extraction completed (and any warnings)
+- Whether `make run` compiled and executed successfully
+- Whether smoke test passed (Step 5a)
+- Whether recompile succeeded (Step 5b)
+- Whether correctness validation passed (Step 5c)
+- Any errors or warnings encountered at each step
@@ -0,0 +1,98 @@
+---
+name: linex-profiling
+description: Profile GPU kernels at source-line granularity with cycle-level timing and stall analysis. Use when identifying performance hotspots at the source code level or analyzing instruction-level metrics mapped to source lines.
+---
+
+# Linex: Source-Level GPU Performance Profiling
+
+Map GPU performance metrics to your source code lines. Get cycle-level timing, stall analysis, and instruction-level metrics for each line of source code.
+
+## When to Use
+
+- User asks to profile a GPU application at source-line granularity
+- Need to identify which specific lines of code are performance bottlenecks
+- Analyzing stall patterns and execution bottlenecks at the source level
+- Understanding cycle-level timing for each line of code
+- Instruction-level analysis mapped to source lines
+
+## Instructions
+
+1. **Ensure the target runs on AMD ROCm 7.0+** with `rocprofv3` available.
+2. **Kernels must be compiled with `-g`** (debug symbols) for source mapping.
+3. **Choose execution path:**
+   - If a Linex MCP server is available, use its MCP tools:
+     - `profile_application` to run and profile a target application with the options below.
+     - `analyze_instruction_hotspots` to perform instruction-level hotspot analysis on collected profiles.
+   - Otherwise use the Python API from the environment where Linex is installed.
+
+### Python API
+
+```python
+from linex import Linex
+
+profiler = Linex(
+    target_cu=0,                      # Target compute unit
+    shader_engine_mask="0xFFFFFFFF",  # All shader engines
+    activity=10,                      # Activity counter polling
+)
+
+profiler.profile("./my_app", kernel_filter="my_kernel")
+
+# Show hotspots (sorted by total_cycles)
+for line in profiler.source_lines[:5]:
+    print(f"{line.file}:{line.line_number}")
+    print(f"  {line.total_cycles:,} cycles ({line.stall_percent:.1f}% stalled)")
+    print(f"  Executed {line.execution_count} times")
+
+# Find memory-bound lines
+memory_bound = [
+    l for l in profiler.source_lines 
+    if l.stall_percent > 50
+]
+
+# Instruction-level analysis
+for line in profiler.source_lines[:1]:
+    for inst in line.instructions:
+        print(f"{inst.isa}: {inst.latency_cycles} cycles")
+```
+
+### SourceLine Properties
+
+- `file` - Source file path
+- `line_number` - Line number
+- `total_cycles` - Sum of all instruction cycles
+- `stall_cycles` - Cycles spent waiting
+- `idle_cycles` - Cycles slot was idle
+- `execution_count` - Total executions
+- `instructions` - List of ISA instructions
+- `stall_percent` - Convenience: stall_cycles / total_cycles * 100
+
+### InstructionData Properties
+
+- `isa` - ISA instruction text
+- `latency_cycles` - Total cycles for this instruction
+- `stall_cycles` - Cycles spent waiting
+- `idle_cycles` - Cycles slot was idle
+- `execution_count` - How many times it ran
+- `instruction_address` - Virtual address in GPU memory
+- `file` - Parsed from source_location
+- `line` - Parsed from source_location
+- `stall_percent` - Convenience: stall_cycles / latency_cycles * 100
+
+## Workflow
+
+1. Ensure the target binary is built with `-g` (debug symbols) for source mapping.
+2. Create a `Linex()` profiler; optionally set `target_cu`, `shader_engine_mask`, or `activity`.
+3. Call `profiler.profile(command, kernel_filter=...)` to run profiling.
+4. Access `profiler.source_lines` (sorted by total_cycles) to find hotspots.
+5. Use `line.stall_percent` to identify memory-bound or dependency-bound lines.
+6. Drill down into `line.instructions` for instruction-level analysis.
+7. Use relative paths for the target binary so the skill is portable.
+
+## Notes
+
+- Requires ROCm 7.0+ with `rocprofv3` support.
+- Source mapping requires kernels compiled with `-g` (debug symbols).
+- `source_lines` are automatically sorted by `total_cycles` (descending).
+- Use `kernel_filter` to profile specific kernels by name (regex pattern).
+- For Triton or other frameworks, ensure debug symbols are available in the compiled output.
@@ -0,0 +1,76 @@
+---
+name: metrix-profiling
+description: Profile GPU kernels when performance analysis or optimization is required. Use for AMD ROCm GPU metrics, bandwidth, cache hit rates, coalescing, or kernel timing.
+---
+
+# Metrix: GPU Profiling
+
+Profile AMD GPU kernels and get human-readable metrics (bandwidth, cache, coalescing, FLOPS). Architecture is auto-detected.
+
+## When to Use
+
+- User asks to profile a GPU application or kernel
+- Performance analysis, optimization, or bottleneck investigation
+- Need HBM/L2/L1 bandwidth, hit rates, or compute metrics
+- Need timing-only runs (fast, no hardware counters)
+
+## Instructions
+
+1. **Ensure the target runs on AMD ROCm** (e.g. `hipcc`-built binary or Python script that launches HIP/ROCm kernels).
+2. **Choose execution path:**
+   - If a Metrix MCP server is available, use its profile tool with the same options below.
+   - Otherwise run the CLI or Python API from the environment where Metrix is installed.
+
+### CLI
+
+From the project or install prefix:
+
+```bash
+# Profile with all metrics (auto-detected arch)
+metrix ./my_app
+
+# Time only (fast, no counters)
+metrix --time-only -n 10 ./my_app
+
+# Filter kernels by name
+metrix --kernel matmul ./my_app
+
+# Specific metrics
+metrix --metrics memory.l2_hit_rate,memory.coalescing_efficiency,compute.total_flops ./my_app
+
+# Save to JSON/CSV
+metrix -o results.json ./my_app
+```
+
+Options: `--profile`/`-p` (run `metrix list profiles` for names: `quick`, `memory`, `memory_bandwidth`, `memory_cache`, `compute`), `--metrics`/`-m`, `--time-only`, `--kernel`/`-k` (regular expression), `--num-replays`/`-n`, `--output`/`-o`, `--top`, `--aggregate`, `--timeout`, `--no-counters`, `--log`/`-l`, `--quiet`/`-q`. Discovery: `metrix list <metrics|profiles|devices>`, `metrix info <metric|profile> <name>`. Note: `metrix list counters` and `metrix info counter <name>` are not implemented yet (CLI reports “not yet implemented”).
+
+### Python API
+
+```python
+from metrix import Metrix
+
+profiler = Metrix()
+results = profiler.profile("./my_app", num_replays=5)
+
+for kernel in results.kernels:
+    print(kernel.name, kernel.duration_us.avg)
+    for metric, stats in kernel.metrics.items():
+        print(f"  {metric}: {stats.avg}")
+```
+
+Use `metrics=[...]` for a subset; omit for all metrics. Use `cwd` when the binary expects a specific working directory.
+
+## Workflow
+
+1. Identify the executable or script to profile (e.g. `./app` or `python run_kernels.py`).
+2. If only timing is needed, use `--time-only` for speed.
+3. If full metrics are needed, run `metrix ./app` (or MCP equivalent); optionally restrict with `--kernel` or `--metrics`.
+4. Interpret results: low L2 hit rate, low coalescing, or low HBM utilization suggest optimization targets.
+5. For automation or tooling, use `-o results.json` and parse the JSON output.
+
+## Key Metrics (reference)
+
+- **Memory:** `memory.hbm_bandwidth_utilization`, `memory.l2_hit_rate`, `memory.l1_hit_rate`, `memory.coalescing_efficiency`, `memory.global_load_efficiency`, `memory.lds_bank_conflicts`, `memory.atomic_latency`
+- **Compute:** `compute.total_flops`, `compute.hbm_gflops`, `compute.hbm_arithmetic_intensity`, `compute.l2_arithmetic_intensity`, `compute.l1_arithmetic_intensity`
+
+Use relative paths for the target binary and output files so the skill is portable across environments.