ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST) by TheTom · Pull Request #170 · TheTom/llama-cpp-turboquant

TheTom · 2026-06-06T02:07:38Z

Summary

Adds two experimental 4-bit FP4 weight tensor types aimed at AMD/ROCm hardware, CPU/reference path:

Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw (one UE4M3 scale per 16 weights).
Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

The quantizer brute-forces all finite UE4M3 scale candidates per block and keeps the lowest-error assignment. Lives in a self-contained ggml/rocmfp4/ directory with its own E4M3 logic, so it does not touch the existing turbo/TQ or MXFP4/NVFP4 paths.

Why

It complements our existing 4-bit weight format (TQ4_1S, software WHT-rotated) with a hardware-native FP4 layout that maps onto AMD's FP4 matmul. Good fit for our ROCm/HIP users and Strix Halo / MI300 class hardware. This PR is the portable CPU/reference foundation; GPU acceleration is intended as follow-up work.

Integration

Type IDs are placed after the fork's turbo/TQ types so nothing collides:

symbol	`GGML_TYPE`	`LLAMA_FTYPE`
`Q4_0_ROCMFP4`	47	41
`Q4_0_ROCMFP4_FAST`	48	42

These IDs are stored in the GGUF tensor type field, so files produced here are read back by builds using the same assignment (documented in ggml/rocmfp4/README.md).

Coverage

Full CPU lifecycle in this change: GGUF type registration, CPU reference quantize/dequantize, row validation, llama-quantize integration, imatrix through the existing path, and model load. No Python/converter changes needed, llama-quantize is the producer.

Testing

Builds clean on Apple Silicon (Metal config), 0 errors.
llama-quantize registers both types:

  41  or  Q4_0_ROCMFP4 :  4.50 bpw ROCmFP4 dual-scale layout
  42  or  Q4_0_ROCMFP4_FAST :  4.25 bpw ROCmFP4 single-scale layout

Follow-up: a quantize round-trip correctness test, and the GPU accel path.

Original quantizer implementation by caf (preserved as the commit author).

Adds two experimental 4-bit ROCmFP4 GGUF tensor types: - Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw. - Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw. CPU/reference portion only: GGUF tensor type registration, CPU reference quantization/dequantization, row validation for ROCmFP4 scale bytes, llama-quantize integration, and imatrix support through the existing quantization path. Existing quantization modes and MXFP4/NVFP4 behavior are unchanged. Type IDs are assigned after the fork's turbo/TQ types: GGML_TYPE 47/48, LLAMA_FTYPE 41/42.

caf and others added 2 commits June 5, 2026 21:06

docs(rocmfp4): note fork type IDs and CPU-reference status

fb32d1a

github-actions Bot added ggml examples labels Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170

ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170
TheTom wants to merge 2 commits into
feature/turboquant-kv-cachefrom
feat/rocmfp4-cpu

TheTom commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TheTom commented Jun 6, 2026

Summary

Why

Integration

Coverage

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant