Performance of llama.cpp with Vulkan #10879
Replies: 285 comments 460 replies
-
|
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
|
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
|
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
|
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
|
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
|
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
|
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
|
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
|
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
|
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
|
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
|
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
|
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
|
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS |
Beta Was this translation helpful? Give feedback.
-
|
❯ ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev Vulkan0
build: 0d0764d (8892) |
Beta Was this translation helpful? Give feedback.
-
|
Framework 13 running AMD Ryzen 5 7640U w/ Radeon 760M Graphics ./build/bin/llama-bench -m ~/Downloads/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
build: 187a456 (8910) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
❯ ~/llama.cpp/build-vulkan/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 1e9d771 (8768) |
Beta Was this translation helpful? Give feedback.
-
|
Yep, I'll play...big uplift in PP compared with the R9700 in the leaderboard, small increase in TG. |
Beta Was this translation helpful? Give feedback.
-
|
❯ ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -dev Vulkan0
build: d05fe1d (9010) |
Beta Was this translation helpful? Give feedback.
-
|
ggml_vulkan: 1 = Tesla P40 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 7b8443a (8966) |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
|
Interestingly I get wildly different scores between Windows and Linux on an Arc Pro B50 ggml_vulkan: 0 = Intel(R) Arc(TM) Pro B50 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat on Linux (Ubuntu 25.10) intel i7-8700
in Windows 11 - Ryzen 5700X
build: aa46bda (9437)
|
Beta Was this translation helpful? Give feedback.
-
Vega 64 (undervolted)Sapphire RX Vega 64 (reference blower-fan type) Result$ RADV_PERFTEST=nogttspill llama-bench --model ./llama-2-7b.Q4_0.gguf -ngl 100 -fa 1 -r 5 --delay 5
build: 5dcb711 (9464) LACT profile:Details[...]
profiles:
Undervolt lowest MAX perf:
gpus:
1002:687F-1002:6B76-0000:2f:00.0:
fan_control_enabled: true
fan_control_settings:
mode: static
static_speed: 1.0
temperature_key: edge
interval_ms: 500
curve:
54: 0.0
55: 0.05
60: 0.2
70: 0.7
80: 0.9
90: 1.0
spindown_delay_ms: 10000
change_threshold: 3
power_cap: 330.0
performance_level: manual
gpu_vf_curve:
5:
voltage: 825
clockspeed: 1400
2:
voltage: 803
6:
voltage: 850
clockspeed: 1520
7:
voltage: 1050
clockspeed: 1580
4:
voltage: 825
1:
voltage: 802
0:
voltage: 801
3:
voltage: 825
mem_vf_curve:
3:
voltage: 980
clockspeed: 1050
power_profile_mode_index: 5
power_states:
core_clock:
- 7
memory_clock:
- 3Vega 64 (default profile)ggml_vulkan: Found 1 Vulkan devices:
build: 5dcb711 (9464) |
Beta Was this translation helpful? Give feedback.
-
|
$> uname -a $> llama-bench -fa 0,1 -ngl 666 -m /zfast/llm/coding/llama-2-7b.Q4_0.gguf ggml_vulkan: Found 2 Vulkan devices:
build: a731805 (9493) Removing BLAS (OpenBLAS) does not decrease results. llama-bench -fa 0,1 -ngl 666 -m /zfast/llm/coding/llama-2-7b.Q4_0.gguf
build: ad1b88c (9525) Gentoo: ggml_vulkan: Found 2 Vulkan devices:
build: 65ef50a (9501) |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
|
Hi,
./llama-bench -fa 0,1 -m ../../models/llama-2-7b.Q4_0.gguf
build: 8ed274e (9630) also OpenVINO backend OpenVINO: using device GPU
build: 8ed274e (9630) SYCL bench results |
Beta Was this translation helpful? Give feedback.
-
|
AMD Radeon RX 580 2048SP (mining edition, 8GB) — Windows, AMD proprietary driver ggml_vulkan: 0 = AMD Radeon RX 580 2048SP (AMD proprietary driver) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | int dot: 0 | matrix cores: none
Noticeably lower than your RX 470 mining edition above (same 2048SP silicon) — almost certainly the Windows proprietary driver vs RADV on Linux, I've seen close to 2x gaps between the two on this exact card on other workloads. Will follow up with a Linux/RADV run on the same hardware for a direct same-chip comparison. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using
-sm none -mg YOUR_GPU_NUMBERunless the model is too big to fit in VRAM. If you use RADV please run with the environment variableRADV_PERFTEST=nogttspillas that can fix a bunch of performance issues.Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed!
Vulkan Scoreboard (Click on the headings to expand the section)
Llama 2 7B, Q4_0, no FA
Llama 2 7B, Q4_0, FA enabled
Beta Was this translation helpful? Give feedback.
All reactions