Releases: KempnerInstitute/KempnerForge
Releases · KempnerInstitute/KempnerForge
KempnerForge v0.1.0
KempnerForge v0.1.0 - Initial Release
PyTorch-native codebase for training foundation models on AI clusters. Fault-tolerant, resilient, distributed, and agent-ready. Built for NeuroAI and general AI research, single GPU to multi-node SLURM, with maximum GPU utilization as a first-class goal.
Highlights
- Decoder-only Transformer + MoE: GQA attention with RoPE, SwiGLU MLP, RMSNorm. MoE variants with softmax and sigmoid top-k routing (DeepSeek-V3 style aux-loss-free balancing), sequence-level auxiliary loss, per-expert gradient scaling, and adaptive bias schedule.
- 4 optimizers, 6 LR schedulers: AdamW (fused), Lion, Muon (Newton-Schulz orthogonalization), Schedule-Free AdamW. Schedulers: cosine, linear, WSD, constant, REX, none.
- Full parallelism stack: FSDP2 via
fully_shard(), tensor parallelism, pipeline parallelism, expert parallelism for MoE, DeviceMesh composition, and FP8 mixed precision hooks. - Distributed checkpointing with exact resume: DCP sharded save and load, non-blocking async writes, auto-resume via the
latestsymlink. Full state restore including dataloader position and RNG. - SLURM-friendly and resilient: SIGTERM / SIGUSR1 handlers for preemption, SLURM requeue-aware via
SLURM_RESTART_COUNT, NaN detection with configurable action (warn, skip, raise), GPU and NCCL liveness probes. Multi-node viasrundirect with InfiniBand auto-detection. - Agent-ready plugin (v0.1): six skills (
cluster-config,smoke-test,slurm-launch,explain-architecture,add-optimizer,component-gaps). Every skill gates on acheck_env.pypreflight that returns actionable fix lines instead of silent failure.
Other Changes
- Benchmarks:
bench_forward.py,bench_moe.py,bench_optimizer.py, plus MFU scaling (dense up to 70B, MoE up to 32 GPUs with EP) and packed-MoE benchmarks. - Documentation: Sphinx site under
docs/with architecture walkthrough, how-to guides, configuration reference, subsystem index, and a dedicated "Agent Integration" section. - Tests: 852 unit tests, integration tests for checkpoint round-trip and data resumption, multi-GPU distributed tests via
torchrun, opt-in e2e tests (4 GPUs, up to 7B).
Install / Upgrade
git clone https://github.com/KempnerInstitute/KempnerForge.git
cd KempnerForge
uv sync
# Install the agent plugin inside your Claude session (optional):
/plugin marketplace add /abs/path/to/KempnerForge
/plugin install kempnerforge@kempnerforge
/reload-plugins
/kempnerforge:cluster-config