Skip to content

ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170

Open
TheTom wants to merge 2 commits into
feature/turboquant-kv-cachefrom
feat/rocmfp4-cpu
Open

ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170
TheTom wants to merge 2 commits into
feature/turboquant-kv-cachefrom
feat/rocmfp4-cpu

Conversation

@TheTom

@TheTom TheTom commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Adds two experimental 4-bit FP4 weight tensor types aimed at AMD/ROCm hardware, CPU/reference path:

  • Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw (one UE4M3 scale per 16 weights).
  • Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

The quantizer brute-forces all finite UE4M3 scale candidates per block and keeps the lowest-error assignment. Lives in a self-contained ggml/rocmfp4/ directory with its own E4M3 logic, so it does not touch the existing turbo/TQ or MXFP4/NVFP4 paths.

Why

It complements our existing 4-bit weight format (TQ4_1S, software WHT-rotated) with a hardware-native FP4 layout that maps onto AMD's FP4 matmul. Good fit for our ROCm/HIP users and Strix Halo / MI300 class hardware. This PR is the portable CPU/reference foundation; GPU acceleration is intended as follow-up work.

Integration

Type IDs are placed after the fork's turbo/TQ types so nothing collides:

symbol GGML_TYPE LLAMA_FTYPE
Q4_0_ROCMFP4 47 41
Q4_0_ROCMFP4_FAST 48 42

These IDs are stored in the GGUF tensor type field, so files produced here are read back by builds using the same assignment (documented in ggml/rocmfp4/README.md).

Coverage

Full CPU lifecycle in this change: GGUF type registration, CPU reference quantize/dequantize, row validation, llama-quantize integration, imatrix through the existing path, and model load. No Python/converter changes needed, llama-quantize is the producer.

Testing

  • Builds clean on Apple Silicon (Metal config), 0 errors.
  • llama-quantize registers both types:
  41  or  Q4_0_ROCMFP4 :  4.50 bpw ROCmFP4 dual-scale layout
  42  or  Q4_0_ROCMFP4_FAST :  4.25 bpw ROCmFP4 single-scale layout
  • Follow-up: a quantize round-trip correctness test, and the GPU accel path.

Original quantizer implementation by caf (preserved as the commit author).

caf and others added 2 commits June 5, 2026 21:06
Adds two experimental 4-bit ROCmFP4 GGUF tensor types:
- Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.
- Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

CPU/reference portion only: GGUF tensor type registration, CPU reference
quantization/dequantization, row validation for ROCmFP4 scale bytes,
llama-quantize integration, and imatrix support through the existing
quantization path. Existing quantization modes and MXFP4/NVFP4 behavior
are unchanged.

Type IDs are assigned after the fork's turbo/TQ types: GGML_TYPE 47/48,
LLAMA_FTYPE 41/42.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant