ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170
Open
TheTom wants to merge 2 commits into
Open
ggml: add ROCmFP4 CPU quantization (experimental Q4_0_ROCMFP4 / _FAST)#170TheTom wants to merge 2 commits into
TheTom wants to merge 2 commits into
Conversation
Adds two experimental 4-bit ROCmFP4 GGUF tensor types: - Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw. - Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw. CPU/reference portion only: GGUF tensor type registration, CPU reference quantization/dequantization, row validation for ROCmFP4 scale bytes, llama-quantize integration, and imatrix support through the existing quantization path. Existing quantization modes and MXFP4/NVFP4 behavior are unchanged. Type IDs are assigned after the fork's turbo/TQ types: GGML_TYPE 47/48, LLAMA_FTYPE 41/42.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two experimental 4-bit FP4 weight tensor types aimed at AMD/ROCm hardware, CPU/reference path:
Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw (one UE4M3 scale per 16 weights).Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.The quantizer brute-forces all finite UE4M3 scale candidates per block and keeps the lowest-error assignment. Lives in a self-contained
ggml/rocmfp4/directory with its own E4M3 logic, so it does not touch the existing turbo/TQ or MXFP4/NVFP4 paths.Why
It complements our existing 4-bit weight format (TQ4_1S, software WHT-rotated) with a hardware-native FP4 layout that maps onto AMD's FP4 matmul. Good fit for our ROCm/HIP users and Strix Halo / MI300 class hardware. This PR is the portable CPU/reference foundation; GPU acceleration is intended as follow-up work.
Integration
Type IDs are placed after the fork's turbo/TQ types so nothing collides:
GGML_TYPELLAMA_FTYPEQ4_0_ROCMFP4Q4_0_ROCMFP4_FASTThese IDs are stored in the GGUF tensor type field, so files produced here are read back by builds using the same assignment (documented in
ggml/rocmfp4/README.md).Coverage
Full CPU lifecycle in this change: GGUF type registration, CPU reference quantize/dequantize, row validation,
llama-quantizeintegration, imatrix through the existing path, and model load. No Python/converter changes needed,llama-quantizeis the producer.Testing
llama-quantizeregisters both types:Original quantizer implementation by caf (preserved as the commit author).