Skip to content

Misc. bug: Inconsistent Flash Attention auto mode in llama-bench #25007

Description

@Only8Bits

Name and Version

./build/bin/llama-cli --version
version: 9784 (8be759e)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: layer 0 is assigned to device ROCm0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
sched_reserve: Flash Attention was auto, set to disabled
llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention

Problem description & steps to reproduce

llama.cpp compiled with ROCm/HIP and Vulkan support, benchmarks were run on both backends (see command line above). Gemma 4 Q4_K model was used but the same problem manifests on Qwen3.5-9B in Q8_K variant.

It seems like only Q8_0 quantization is working properly on ROCm backend? Q5_1, Q5_0 and Q4 all fail to create context due to Flash Attention ending up disabled in auto mode. Forcing it to enabled via "-fa on" argument will run the bench but:

  • There is no mention of Flash Attention being enabled in the output log
  • Huge performance drop is observed, pp512 drops 90% and tg128 also suffers but not as bad, about 25% drop

I couldn't find any explanation for this difference between Vulkan and ROCm backends so I assume it's a bug. If it's simply a case of some quantization formats not (yet) being implemented in ROCm then perhaps a warning message could be output to avoid any further confusion.

Verbose log without grep is too big so I'm not attaching it to keep this cleaner, I can provide it if required. See the command line section for brief output.

First Bad Commit

No idea, I have not tried benchmarks with KV quantization in the past :(

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions