Name and Version
./build/bin/llama-cli --version
version: 9784 (8be759e)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled
./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled
./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled
./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: layer 0 is assigned to device ROCm0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
sched_reserve: Flash Attention was auto, set to disabled
llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
Problem description & steps to reproduce
llama.cpp compiled with ROCm/HIP and Vulkan support, benchmarks were run on both backends (see command line above). Gemma 4 Q4_K model was used but the same problem manifests on Qwen3.5-9B in Q8_K variant.
It seems like only Q8_0 quantization is working properly on ROCm backend? Q5_1, Q5_0 and Q4 all fail to create context due to Flash Attention ending up disabled in auto mode. Forcing it to enabled via "-fa on" argument will run the bench but:
- There is no mention of Flash Attention being enabled in the output log
- Huge performance drop is observed, pp512 drops 90% and tg128 also suffers but not as bad, about 25% drop
I couldn't find any explanation for this difference between Vulkan and ROCm backends so I assume it's a bug. If it's simply a case of some quantization formats not (yet) being implemented in ROCm then perhaps a warning message could be output to avoid any further confusion.
Verbose log without grep is too big so I'm not attaching it to keep this cleaner, I can provide it if required. See the command line section for brief output.
First Bad Commit
No idea, I have not tried benchmarks with KV quantization in the past :(
Relevant log output
Logs
Name and Version
./build/bin/llama-cli --version
version: 9784 (8be759e)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench
Command line
Problem description & steps to reproduce
llama.cpp compiled with ROCm/HIP and Vulkan support, benchmarks were run on both backends (see command line above). Gemma 4 Q4_K model was used but the same problem manifests on Qwen3.5-9B in Q8_K variant.
It seems like only Q8_0 quantization is working properly on ROCm backend? Q5_1, Q5_0 and Q4 all fail to create context due to Flash Attention ending up disabled in auto mode. Forcing it to enabled via "-fa on" argument will run the bench but:
I couldn't find any explanation for this difference between Vulkan and ROCm backends so I assume it's a bug. If it's simply a case of some quantization formats not (yet) being implemented in ROCm then perhaps a warning message could be output to avoid any further confusion.
Verbose log without grep is too big so I'm not attaching it to keep this cleaner, I can provide it if required. See the command line section for brief output.
First Bad Commit
No idea, I have not tried benchmarks with KV quantization in the past :(
Relevant log output
Logs