Misc. bug: Inconsistent Flash Attention auto mode in llama-bench

### Name and Version

./build/bin/llama-cli --version
version: 9784 (8be759e6f)
built with GNU 13.3.0 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench

### Command line

```shell
./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q8_0 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev Vulkan0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: Flash Attention was auto, set to enabled

./build/bin/llama-bench --offline -m models/gemma/gemma-4-26B-A4B-it-qat-GGUF/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -dev ROCm0 -ctk q8_0 -ctv q5_1 -v 2>&1 | grep Flash
sched_reserve: layer 0 is assigned to device ROCm0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
sched_reserve: Flash Attention was auto, set to disabled
llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
```

### Problem description & steps to reproduce

llama.cpp compiled with ROCm/HIP and Vulkan support, benchmarks were run on both backends (see command line above). Gemma 4 Q4_K model was used but the same problem manifests on Qwen3.5-9B in Q8_K variant.

It seems like only Q8_0 quantization is working properly on ROCm backend? Q5_1, Q5_0 and Q4 all fail to create context due to Flash Attention ending up disabled in auto mode. Forcing it to enabled via "-fa on" argument will run the bench but:
- There is no mention of Flash Attention being enabled in the output log
- Huge performance drop is observed, pp512 drops 90% and tg128 also suffers but not as bad, about 25% drop

I couldn't find any explanation for this difference between Vulkan and ROCm backends so I assume it's a bug. If it's simply a case of some quantization formats not (yet) being implemented in ROCm then perhaps a warning message could be output to avoid any further confusion.

Verbose log without grep is too big so I'm not attaching it to keep this cleaner, I can provide it if required. See the command line section for brief output.

### First Bad Commit

No idea, I have not tried  benchmarks with KV quantization in the past :(

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Inconsistent Flash Attention auto mode in llama-bench #25007

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: Inconsistent Flash Attention auto mode in llama-bench #25007

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions