
This is SuperKitty, and she loves to MEOW (Maxamizing Effficient Operation's per Watt), inspired by ThunderKittens
Consumer hardware, especially apple silicon is immenstly power dense and underutilized with today's open source infrence/training supporitng libraries. SK is here to change that.
Writing deep learning metal kernels should be easy; this library aims to do such that, without sacrificing performance for abstraction, it delivers the fastest compute (not theoretical), so you can sqeeze out maximum perf! It is .metal, headers and .c, and was developed with easy use in mind. It has an assortment of metal kernels so your chips don't starve!
Bridging the gap between bleeding intelligence and consumer hardware!
This is superkitty, and she loves to MEOW! - Maxamizing effficient operation's per Watt
It is:
- Simple
SuperKittens is straightforward to write and works seamlessly out the box with your existing apple silicon code running on any of the M(1, 2, 3, 4, 5) chips.
- Fast
The aim was never sacrificing perf for easier abstractions, we didn't! In opposite, we aim to provide simpler, yet much faster kernels that are still performant.
- We currently only support M1 and M2 and are in the process of adding support for M2+.
Wheels are published as GitHub Release assets on the private repo. Auth via gh (preferred) or a fine-grained GH_TOKEN with repo:read scope.
# preferred — uses your gh auth, no token plumbing
gh release download dev-latest -p '*.whl' -R Lazarus-931/SuperKittens && pip install superkittens-*.whl
# or with a token
pip install "https://${GH_TOKEN}@github.com/Lazarus-931/SuperKittens/releases/download/dev-latest/superkittens-<version>-cp312-cp312-macosx_<ver>_arm64.whl"Pinned versions live under tags (v0.1.0, ...) once cut; dev-latest floats on main.
git clone https://github.com/Lazarus-931/SuperKittens.git
cd SuperKittens
./build.sh # compiles Metal kernels → build/libsk.metallib + libsk.dylibimport numpy as np
from sk.src.py import activation
x = np.random.randn(512, 1024).astype(np.float16)
y = activation.gelu(x) # dispatches Metal kernel via ctypes → libsk.dylibYou need the Metal toolchain (metal, metallib). Command Line Tools alone are not enough — they ship metal but not metallib. Two options:
-
Full Xcode (App Store) — install Xcode, then:
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer sudo xcodebuild -license accept
-
Metal toolchain only (smaller, requires Xcode already installed to invoke
xcodebuild):xcodebuild -downloadComponent MetalToolchain
Verify:
xcrun -f metallib # should print a path
xcrun -f metalPython (3.10+, Homebrew recommended):
python3 -m venv ~/sk-venv
source ~/sk-venv/bin/activate
pip install -U "huggingface_hub[cli]" numpy sentencepiece tokenizersKernels & their respetive benchmarks done
[INSERT TABLE HERE, ROWS ARE KERNELS, CHIPS ARE COLS]
The whole point of SuperKittens is giving you fast, composable Metal primitives you can drop into any project — a Swift app, a C++ inference engine, whatever. No framework lock-in, just headers and shaders.
Here's where we're headed:
- Templated attention — support any head dim (64, 96, 128, 256) and sequence length out of the box, not just hardcoded configs
- Causal masking — fused into the attention kernel, not bolted on after
- Multi-head and GQA — batched heads with grouped-query attention so you can run real models
- GEMM for common inference shapes — not trying to be a general BLAS, just the shapes that actually show up in transformer inference
- One include, everything works —
#include "superkittens.h"gives you BlockMMA, Tile, Frag, loaders, and every fused kernel. Compose them into your own stuff or use the ready-made ones - Docs that actually help — examples showing how to build a custom kernel from the primitives, not just API reference