--override-tensor exps=CPU causes a performance regression on Vulkan (AMD) — opposite of reported CUDA behavior #24846

aivisionslab-studios · 2026-06-20T22:44:50Z

aivisionslab-studios
Jun 20, 2026

Testing MoE models (Qwen3.5-35B-A3B, Q4_K_M/Q6_K) on an RX 580 8GB (Polaris/gfx803) via the Vulkan backend, Windows and Linux (Mesa RADV) both tested.

--override-tensor exps=CPU is documented as a CUDA optimization — keeping MoE expert tensors off the GPU to save VRAM/bandwidth on Nvidia setups. On Vulkan here it does the opposite: consistent regression in both environments.

Platform	Baseline (ngl set)	With `exps=CPU`	Delta
Windows	7.62 tok/s (-ngl 10)	6.92 tok/s	-9%
Linux (RADV)	5.18 tok/s (-ngl 20)	4.62 tok/s	-11%

Setup: Xeon E5-2690 v3, 32GB DDR4 ECC, llama.cpp built with -DGGML_VULKAN=ON, no ROCm/HIP anywhere in the stack.

My read: on Vulkan, redirecting expert tensors back and forth seems to add PCIe/transfer overhead that CUDA's memory model doesn't have the same way — but I don't have visibility into why CUDA benefits while Vulkan doesn't. Anyone with more backend-internals knowledge know if this is expected, or worth filing as an actual issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--override-tensor exps=CPU causes a performance regression on Vulkan (AMD) — opposite of reported CUDA behavior #24846

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

--override-tensor exps=CPU causes a performance regression on Vulkan (AMD) — opposite of reported CUDA behavior #24846

Uh oh!

aivisionslab-studios Jun 20, 2026

Replies: 0 comments

aivisionslab-studios
Jun 20, 2026