Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM. If you use RADV please run with the environment variable RADV_PERFTEST=nogttspill as that can fix a bunch of performance issues.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
./bin/llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed!

Vulkan Scoreboard (Click on the headings to expand the section)

Llama 2 7B, Q4_0, no FA

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	10381.64 ± 508.84	263.63 ± 0.91	`ca71fb9`	coopmat2
AMD Radeon RX 7900 XTX	3531.93 ± 31.74	191.28 ± 0.20	`2f0c2db`
Nvidia RTX 4090	9452.03 ± 187.70	187.97 ± 0.21	`4ae88d0`	coopmat2
Nvidia RTX 5080	7444.99 ± 20.11	185.10 ± 0.54	`f6b533d`	coopmat2
Nvidia RTX 3090	4666.15 ± 12.89	164.05 ± 0.89	`d05fe1d`	coopmat2
Nvidia A100	6389.86 ± 4.83	160.78 ± 0.16	`2257758`	coopmat2
Nvidia RTX 4080 Super	7101.18 ± 269.79	147.13 ± 5.64	`81086cd`	coopmat2
Nvidia RTX 3080	4287.11 ± 55.50	139.15 ± 0.05	`7c7d6ce`	coopmat2
Nvidia RTX A5000	3641.55 ± 9.05	139.89 ± 0.69	`4ae88d0`	coopmat2
AMD Radeon RX 9070 XT	5036.04 ± 88.16	137.11 ± 0.02	`e9fd8dc`
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	`d13d0f6`	coopmat2
AMD Radeon AI Pro R9700	5281.22 ± 48.93	133.27 ± 0.33	`f9f3365`
Nvidia Tesla V100	1391.39 ± 1.19	129.58 ± 0.58	`7d77f07`
Nvidia RTX 4070 Ti Super	6099.18 ± 154.30	129.45 ± 0.18	`4ae88d0`	coopmat2
AMD Radeon RX 7900 XT	2941.58 ± 17.17	123.18 ± 0.40	`71e74a3`
AMD Radeon RX 9070	3164.10 ± 66.84	119.71 ± 3.40	`21c17b5`
AMD Radeon RX 7800 XT	2017.33 ± 19.30	118.27 ± 0.27	`4fdbc1e`
AMD Radeon RX 7900 GRE	2336.31 ± 7.52	116.11 ± 0.26	`4b2a477`
Apple M3 Ultra	1116.83 ± 0.55	115.54 ± 0.78	`2d451c8`	MoltenVK
Intel Arc Pro B70	3379.00 ± 47.92	112.02 ± 1.08	`b863507`
Nvidia A30	3290.70 ± 12.03	111.48 ± 0.18	`583cb83`
Nvidia RTX 4070 Ti	4981.44 ± 102.35	110.53 ± 0.00	`516a4ca`	coopmat2
Nvidia RTX 4070 Super	4608.20 ± 31.66	108.74 ± 0.18	`c945aaa`	coopmat2
AMD Radeon Instinct MI50	1119.55 ± 1.50	108.51 ± 0.81	`3af34b9`
AMD Radeon RX 6900 XT	1901.20 ± 36.70	108.00 ± 0.03	`a972fae`
AMD Radeon Pro VII	912.47 ± 1.06	106.03 ± 0.89	N/A
Nvidia Titan V	796.29 ± 5.84	105.06 ± 0.27	`e56abd2`
AMD Radeon VII	1059.14 ± 0.56	101.19 ± 0.53	`77d6ae4`
AMD Radeon RX 6800 XT	1752.92 ± 1.71	100.32 ± 0.97	N/A
Nvidia RTX 2080 Ti	1888.24 ± 9.20	97.58 ± 6.60	N/A
AMD Radeon RX 6800	1698.69 ± 0.80	95.61 ± 0.19	`4b385bf`
AMD Radeon Pro W6800X Duo	687.71 ± 4.33	94.82 ± 0.12	N/A
Nvidia RTX 5060 Ti	3460.92 ± 7.16	93.51 ± 0.15	`89f10ba`	coopmat2
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	`9a48399`
Nvidia RTX 3060 Ti	2361.43 ± 4.32	86.74 ± 0.34	`b42c7fa`	coopmat2
AMD Radeon Pro W6800X	510.80 ± 0.13	86.47 ± 0.46	`13b4548`	MoltenVK
AMD Radeon RX 6700 XT	1051.20 ± 0.98	83.88 ± 0.08	`6d75883`
AMD Radeon RX 6750 XT	1040.58 ± 0.35	81.98 ± 0.03	`228f34c`
AMD Radeon Pro V620	1595.32 ± 1.59	81.78 ± 0.06	`03d4698`
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	`1b8fb81`
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	`504af20`
Nvidia RTX 3060	1815.70 ± 5.85	75.94 ± 0.80	`92c0b38`	coopmat2
Apple M4 Max	724.77 ± 20.93	75.02 ± 0.14	1ece0cb6
Nvidia Tesla T10	1692.70 ± 2.05	75.01 ± 0.21	`7f76692`	coopmat2
Nvidia RTX A4000	2248.14 ± 7.59	73.74 ± 0.08	`f5245b5`	coopmat2
AMD Radeon RX 5700 XT	529.69 ± 0.26	70.73 ± 0.04	`4fdbc1e`
AMD Radeon RX 9060 XT	2141.67 ± 6.87	70.54 ± 0.74	`ed52f36`
Intel Arc B580	620.94 ± 15.33	70.14 ± 0.28	`7f76692`
AMD Radeon Pro V540	583.88 ± 6.56	69.64 ± 0.24	`9da3dcd`
AMD Radeon Pro W5700	449.85 ± 0.46	68.55 ± 0.15	`23bc779`
Intel Arc Pro B60	522.36 ± 3.60	68.55 ± 0.01	`516a4ca`
Nvidia GTX 1080 Ti	540.69 ± 0.71	64.99 ± 0.08	`360d653`
Nvidia RTX 2070 Super	1199.13 ± 7.70	64.64 ± 0.20	`b7552cf`
Nvidia RTX 3070 Mobile	1689.40 ± 19.57	63.64 ± 0.39	`ceff6bb`	coopmat2
AMD Radeon RX Vega 64	575.06 ± 1.03	63.43 ± 0.07	`5dcb711`
Nvidia Tesla P100	678.14 ± 1.40	63.16 ± 0.06	`eec1e33`
AMD BC-250	370.66 ± 0.04	62.32 ± 0.32	`5886f4f`
AMD Radeon RX 6650 XT	1029.52 ± 1.21	62.14 ± 0.02	`dbb852b`
Nvidia RTX 4060 Mobile	2135.66 ± 23.18	59.53 ± 0.03	`a5c07dc`	coopmat2
Nvidia Tesla P40	488.06 ± 0.27	59.36 ± 0.16	N/A
Nvidia GTX 1660 Ti Mobile	511.67 ± 2.85	56.60 ± 0.07	b43556e
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`
AMD Ryzen Al Max+ 395	1288.96 ± 6.49	53.59 ± 0.38	`7f76692`
AMD Radeon RX 7600 XT	840.85 ± 3.02	53.02 ± 0.01	`01d8eaa`
Intel Arc A770	1073.85 + 29.68	52.56 + 0.11	`a69d54f`
Nvidia GB10	2737.79 ± 19.56	52.28 ± 0.03	`b9da444`	coopmat2
AMD FirePro S9300 x2	247.26 ± 0.43	51.86 ± 0.11	`eec1e33`	Split across two GPUs
AMD Radeon RX 6600	761.89 ± 1.76	50.63 ± 0.02	`b1c70e2`
AMD Radeon RX Vega 56	439.87 ± 0.61	50.23 ± 0.14	`92c0b38`
Intel Arc B570	913.95 ± 0.90	49.64 ± 0.03	`7f76692`
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	`dbb3a47`
AMD Radeon RX 6800M	861.99 ± 7.67	48.71 ± 0.71	`8e6f8bc`
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	`fe5b78c`
Intel Arc A770M	875.92 ± 2.16	47.69 ± 0.16	`eeee367`
Intel Arc Pro B50	1186.45 ± 4.43	46.30 ± 0.07	`aa46bda`
Nvidia P104-100	311.90 ± 0.22	46.18 ± 0.05	`eec1e33`
Nvidia RTX A2000	1245.19 ± 8.76	45.52 ± 0.54	`b1afcab`	coopmat2
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	`b9ab0a4`	eGPU
AMD Radeon Pro V340	375.41 ± 0.24	45.16 ± 0.06	`9da3dcd`	Split across two GPUs
Nvidia GTX 1070 Ti	297.50 ± 0.54	42.86 ± 1.20	`860a9e4`	eGPU
Intel Arc A750	1075.94 ± 13.89	42.66 ± 0.18	`c1b1876`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
Nvidia GTX 1070	321.57 ± 0.93	41.48 ± 0.09	`eec1e33`
Nvidia Tesla M40	92.48 ± 0.02	39.35 ± 1.22	`b8372ee`
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	`de4c07f`
AMD Radeon RX 470	218.07 ± 0.56	38.63 ± 0.21	`e288693`
AMD Radeon RX 570	226.11 ± 0.45	37.44 ± 0.03	`e583f3b`
AMD Radeon Pro W5500	315.39 ± 3.76	36.82 ± 0.38	`860a9e4`
AMD Radeon RX 480	248.66 ± 0.28	34.71 ± 0.14	`3b15924`
Apple M2 Ultra	205.98 ± 0.02	34.34 ± 0.12	`dbb852b`	Asahi Linux
Nvidia GTX 980	186.24 ± 0.09	33.90 ± 0.51	`860a9e4`
Nvidia P106-100	183.78 ± 0.26	29.77 ± 0.04	`23bc779`
AMD FirePro W8100	155.22 ± 0.17	29.52 ± 0.05	`4536363`
Nvidia Tesla P4	265.54 ± 0.21	28.03 ± 0.14	`24d2ee0`
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Apple M3	263.70 ± 0.02	26.39 ± 0.14	`b9ab0a4`	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Split across two GPUs
Nvidia Quadro P2000	169.55 ± 0.17	23.05 ± 0.03	`63f8fe0`
Intel Core Ultra 200 Series	544.95 ± 4.15	22.49 ± 0.09	`cea560f`
AMD Ryzen AI 9 300 Series	479.07 ± 0.41	22.41 ± 0.18	N/A
AMD Ryzen 6000 Series	240.89 ± 0.52	21.26 ± 0.08	`ee09828`
Apple M2 Pro	62.70 ± 0.03	20.95 ± 0.11	`1fe0029`	Asahi Linux
Nvidia GTX 1050 Ti	136.42 ± 0.67	20.96 ± 0.21	`2f0c2db`
AMD Ryzen 7000 Series	281.62 ± 1.56	19.91 ± 0.07	ebce03e
AMD Ryzen 8000 Series	267.27 ± 7.61	19.81 ± 0.08	`1e9d771`
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	`d3bd719`	MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100	78.79 ± 0.10	16.05 ± 0.07	`860a9e4`
Apple M2	50.79 ± 0.16	13.50 ± 0.02	`8c0d6bb`	Asahi Linux
Apple M1	38.29 ± 0.00	12.47 ± 0.03	`2370665`	Asahi Linux
AMD Ryzen 5000 Series	90.55 ± 0.08	10.98 ± 0.07	`d84635b`
Intel Core 1100 Series	187.20 ± 1.78	10.39 ± 0.04	`abb9f3c`
AMD Radeon RX 550	52.66 ± 0.49	10.20 ± 0.01	N/A
AMD Ryzen 4000 Series	103.87 ± 0.02	9.63 ± 0.01	`4b385bf`
Nvidia Tesla K80	89.46 ± 0.10	9.39 ± 0.06	`5d46bab`	Running on single GPU
Nvidia Tesla K40	64.37 ± 0.09	9.30 ± 0.19	`eec1e33`
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	`b9ab0a4`	GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 100 Series	185.51 ± 0.22	8.21 ± 0.07	`1d72c84`
AMD Ryzen 3000 Series	48.63 ± 0.10	8.49 ± 0.01	`1fe0029`
Intel Core 1000 Series	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel Core 8000 Series	25.43 ± 0.17	3.35 ± 0.03	`c4df49a`
Intel N150	28.84 ± 0.02	2.93 ± 0.00	`4f63cd7`
CIX CD8180	2.80 ± 0.01	5.51 ± 0.00	`4dca015`

Llama 2 7B, Q4_0, FA enabled

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	11796.38 ± 601.36	273.68 ± 0.52	`ca71fb9`	coopmat2
AMD Radeon RX 7900 XTX	3332.90 ± 11.47	195.30 ± 0.23	`2f0c2db`
Nvidia RTX 5080	8054.59 ± 35.68	192.17 ± 0.21	`f6b533d`	coopmat2
Nvidia RTX 4090	10830.41 ± 36.25	190.10 ± 0.31	`4ae88d0`	coopmat2
Nvidia RTX 5070 Ti	7567.94 ± 41.96	176.18 ± 1.12	`0d0764d`	coopmat2
Nvidia RTX 3090	5200.68 ± 16.44	171.61 ± 2.69	`d05fe1d`	coopmat2
Nvidia A100	7064.40 ± 1.63	170.56 ± 0.02	`2257758`	coopmat2
Nvidia RTX 4080 Super	8007.37 ± 46.03	150.20 ± 0.26	`81086cd`	coopmat2
Nvidia RTX 3080	4913.83 ± 21.52	145.74 ± 0.16	`7c7d6ce`	coopmat2
Nvidia Tesla V100	1411.25 ± 2.12	142.13 ± 0.03	`7d77f07`
Nvidia RTX A5000	4071.22 ± 13.13	140.43 ± 0.22	`4ae88d0`	coopmat2
Nvidia RTX 4070 Ti Super	6801.18 ± 40.12	135.81 ± 4.29	`4ae88d0`	coopmat2
AMD Radeon AI Pro R9700	5620.02 ± 60.27	138.79 ± 0.10	`f9f3365`
AMD Radeon RX 9070 XT	5048.07 ± 2.12	131.54 ± 0.05	`e9fd8dc`
AMD Radeon RX 7800 XT	2197.05 ± 6.03	124.86 ± 0.10	`4fdbc1e`
AMD Radeon RX 7900 XT	2701.13 ± 8.75	120.62 ± 0.36	`71e74a3`
AMD Radeon RX 9070	2859.98 ± 31.53	119.51 ± 0.13	`21c17b5`
AMD Radeon Instinct MI50	1127.37 ± 0.48	117.94 ± 0.02	`3af34b9`
Nvidia A30	3158.25 ± 22.59	117.43 ± 0.23	`583cb83`
Intel Arc Pro B70	3150.55 ± 9.06	114.19 ± 1.07	`b863507`
Nvidia RTX 4070 Ti	5515.26 ± 7.92	113.52 ± 0.00	`516a4ca`	coopmat2
Nvidia RTX 4070 Super	5198.91 ± 29.02	112.00 ± 0.17	`c945aaa`	coopmat2
AMD Radeon VII	1007.19 ± 6.55	110.86 ± 0.34	`77d6ae4`
AMD Radeon RX 7900 GRE	2251.37 ± 2.91	110.50 ± 0.23	`4b2a477`
Nvidia Titan V	792.74 ± 4.30	109.21 ± 0.72	`e56abd2`
AMD Radeon Pro VII	783.94 ± 0.77	108.45 ± 0.48	N/A
AMD Radeon RX 6900 XT	1761.93 ± 4.75	106.15 ± 0.04	`a972fae`
Nvidia RTX 2080 Ti	1936.25 ± 32.08	100.99 ± 0.24	N/A
AMD Radeon RX 6800 XT	1704.79 ± 0.71	100.50 ± 0.06	N/A
AMD Radeon Pro W6800X Duo	795.28 ± 0.72	100.08 ± 0.02	N/A
Nvidia RTX 5060 Ti	3912.65 ± 5.86	97.01 ± 0.14	`89f10ba`	coopmat2
AMD Radeon RX 6800	1749.46 ± 3.36	96.65 ± 0.48	`4b385bf`
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	`9a48399`	coopmat2
Nvidia RTX 3060 Ti	2644.95 ± 26.82	90.35 ± 1.00	`b42c7fa`	coopmat2
AMD Radeon RX 6750 XT	997.05 ± 0.45	82.29 ± 0.06	`228f34c`
AMD Radeon RX 6700 XT	1010.90 ± 12.89	81.86 ± 0.19	`6d75883`
Nvidia RTX 3060	2012.88 ± 10.12	80.59 ± 0.02	`92c0b38`	coopmat2
AMD Radeon Pro V620	1556.31 ± 2.82	79.24 ± 0.09	`03d4698`
Nvidia RTX A4000	2482.74 ± 26.05	76.07 ± 0.08	`f5245b5`	coopmat2
Nvidia Tesla T10	1840.14 ± 1.22	76.05 ± 0.13	`7f76692`	coopmat2
AMD Radeon RX 5700 XT	538.31 ± 0.35	74.43 ± 0.03	`4fdbc1e`
Intel Arc B580	419.49 ± 3.37	72.00 ± 0.24	`7f76692`
Apple M4 Max	557.46 ± 26.87	71.79 ± 4.16	1ece0cb6
AMD Radeon Pro W5700	446.98 ± 0.39	71.30 ± 0.24	`23bc779`
Intel Arc Pro B60	274.76 ± 0.27	70.54 ± 0.03	`516a4ca`
AMD Radeon RX 9060 XT	1915.41 ± 7.90	70.52 ± 0.16	`ed52f36`
AMD Radeon RX Vega 64	584.98 ± 1.12	67.70 ± 0.09	`5dcb711`
Nvidia Tesla P100	685.51 ± 0.88	66.48 ± 0.02	`eec1e33`
Nvidia GTX 1080 Ti	529.96 ± 0.38	64.63 ± 0.10	`360d653`
AMD BC-250	356.87 ± 1.24	63.14 ± 0.09	`5886f4f`
Nvidia RTX 3070 Mobile	1832.07 ± 57.14	62.92 ± 0.37	`ceff6bb`	coopmat2
AMD Radeon RX 6650 XT	1088.90 ± 0.40	64.53 ± 0.75	`dbb852b`
Nvidia RTX 4060 Mobile	2358.03 ± 12.17	60.01 ± 0.08	`a5c07dc`	coopmat2
Nvidia Tesla P40	484.37 ± 0.27	59.22 ± 0.15	N/A
Nvidia GTX 1660 Ti Mobile	514.34 ± 0.88	57.30 ± 0.42	b43556e
AMD Radeon RX 7600 XT	1024.38 ± 7.56	56.11 ± 0.02	`01d8eaa`
AMD FirePro S9300 x2	243.33 ± 0.22	55.64 ± 0.06	`eec1e33`	Split across two GPUs
Nvidia GB10	3279.89 ± 26.78	53.64 ± 0.05	`b9da444`	coopmat2
AMD Radeon RX 6600	808.76 ± 0.15	53.24 ± 0.03	`b1c70e2`
Intel Arc A770	1119.68 + 30.25	53.07 + 0.09	`a69d54f`
AMD Ryzen Al Max+ 395	1357.07 ± 10.94	53.00 ± 0.13	`7f76692`
AMD Radeon RX Vega 56	428.54 ± 0.50	52.66 ± 0.03	`92c0b38`
Intel Arc B570	288.51 ± 0.09	50.49 ± 0.05	`7f76692`
AMD Radeon RX 6800M	784.16 ± 2.76	49.06 ± 0.34	`8e6f8bc`
Nvidia RTX A2000	1361.85 ± 3.26	45.69 ± 0.20	`b1afcab`	coopmat2
Intel Arc A770M	384.74 ± 0.78	45.68 ± 0.06	`eeee367`
Nvidia P104-100	325.30 ± 0.25	48.64 ± 0.04	`eec1e33`
Intel Arc Pro B50	1122.53 ± 3.73	47.69 ± 0.06	`aa46bda`
AMD Radeon Pro V340	360.23 ± 0.74	47.54 ± 0.06	`9da3dcd`	Split across two GPUs
Intel Arc A750	303.37 ± 1.44	43.96 ± 0.03	`c1b1876`
Nvidia GTX 1070 Ti	292.85 ± 0.23	43.42 ± 0.34	`860a9e4`	eGPU
Nvidia GTX 1070	330.84 ± 1.02	43.33 ± 0.06	`360d653`
Nvidia Tesla M40	93.35 ± 0.01	41.68 ± 0.01	`b8372ee`
AMD Radeon RX 570	229.23 ± 0.45	40.18 ± 0.00	`e583f3b`
AMD Radeon RX 470	197.26 ± 0.27	37.28 ± 0.11	`3769fe6`
AMD Radeon RX 480	194.52 ± 0.61	37.23 ± 0.09	`0bcb40b`
Apple M2 Ultra	198.83 ± 0.85	198.83 ± 0.85	`dbb852b`	Asahi Linux
Nvidia GTX 980	180.97 ± 0.74	34.16 ± 0.10	`860a9e4`
Nvidia P106-100	183.40 ± 0.34	30.79 ± 0.32	`23bc779`
AMD FirePro W8100	140.52 ± 0.34	29.28 ± 0.14	`4536363`
Nvidia Tesla P4	287.14 ± 0.29	28.37 ± 0.24	`24d2ee0`
Nvidia Quadro P2000	181.71 ± 0.12	23.77 ± 0.02	`63f8fe0`
Intel Core Ultra 200 Series	536.48 ± 1.27	23.05 ± 0.04	`cea560f`
AMD Ryzen AI 9 300 Series	532.59 ± 3.55	22.31 ± 0.06	N/A
AMD Ryzen 8000 Series	311.02 ± 0.12	21.28 ± 0.01	`1e9d771`
AMD Ryzen 6000 Series	277.91 ± 0.37	21.15 ± 0.09	`ee09828`
Apple M2 Pro	58.86 ± 0.02	20.97 ± 0.03	`1fe0029`	Asahi Linux
AMD Ryzen 7000 Series	312.85 ± 2.51	20.09 ± 0.35	`835b2b9`
Nvidia GTX 1050 Ti	127.54 ± 1.03	20.08 ± 0.17	`2f0c2db`
AMD Radeon Pro WX 4100	75.59 ± 0.19	16.56 ± 0.04	`860a9e4`
Apple M1	35.93 ± 0.00	12.85 ± 0.02	`2370665`	Asahi Linux
Apple M2	46.81 ± 0.08	12.25 ± 2.30	`8c0d6bb`	Asahi Linux
AMD Ryzen 5000 Series	79.06 ± 0.01	10.75 ± 0.00	`5d195f1`
Intel Core 1100 Series	174.77 ± 4.47	10.58 ± 0.03	`abb9f3c`
Nvidia Tesla K40	64.37 ± 0.02	9.92 ± 0.06	`eec1e33`
AMD Ryzen 4000 Series	113.32 ± 0.01	9.87 ± 0.01	`4b385bf`
Nvidia Tesla K80	88.26 ± 0.19	9.49 ± 0.01	`5d46bab`	Running on single GPU
AMD Ryzen 5 3000 Series	47.41 ± 0.14	8.47 ± 0.01	`1fe0029`
Intel Core Ultra 100 Series	77.66 ± 2.75	7.75 ± 0.05	`2e89f76`
Intel Core 8000 Series	25.55 ± 0.04	3.35 ± 0.02	`c4df49a`
Intel N150	25.59 ± 0.00	2.91 ± 0.00	`4f63cd7`

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

2 replies

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	154.96 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.55 ± 0.17

netrunnereve Aug 22, 2025
Collaborator Author

With FA:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 45363632c (6249)

model	size	params	backend	ngl	threads	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	none	0	pp512	155.22 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	none	0	tg128	29.52 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	none	1	pp512	140.52 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	none	1	tg128	29.28 ± 0.14

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

5 replies

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	185.48 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.94 ± 0.06

netrunnereve Aug 22, 2025
Collaborator Author

With FA:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 45363632c (6249)

model	size	params	backend	ngl	threads	main_gpu	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	1	none	0	pp512	185.73 ± 0.69
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	1	none	0	tg128	34.89 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	1	none	1	pp512	179.01 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	1	none	1	tg128	34.65 ± 0.17

pebaryan Aug 23, 2025

i got the mining edition 8 gb ran slightly better

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	218.07 ± 0.56
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	38.63 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	197.94 ± 2.94
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	35.14 ± 1.93

build: e288693 (6242)

netrunnereve Feb 25, 2026
Collaborator Author

Here's a rerun with the recent flash attention updates:

ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 3769fe6eb (8156)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	197.19 ± 0.63
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	34.42 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	197.26 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	37.28 ± 0.11

0cc4m Feb 25, 2026
Collaborator

Nice, it seems to help even on Polaris.

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

3 replies

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

tsugabloom Mar 29, 2025

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
  /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)


-- Configuring incomplete, errors occurred!

tsugabloom Mar 29, 2025

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025
Collaborator

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

8 replies

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025
Collaborator

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

qnixsynapse Feb 9, 2025
Collaborator

@0cc4m https://gitlab.freedesktop.org/mesa/mesa/-/issues/12585

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

4 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

vkhodygo Feb 5, 2025

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	50.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.30 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	pp512	50.90 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	tg128	8.11 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	pp512	50.91 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	tg128	7.99 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	pp512	50.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	tg128	7.92 ± 0.24

macie Jun 1, 2025

A few months later and I get:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	106.19 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	5.89 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	pp512	73.26 ± 1.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	tg128	5.24 ± 0.02

build: f3a4b16 (5568)

I run it on Linux (Arch with lama.cpp-vulkan-git package compiled by GCC 15). From my tests, only Vulkan backend (1.4.313) provides visible gains on i7-1185G7 processor, when comparing to other methods (I tried different combinations of GCC and Intel DPC++ compilers and backends: BLIS, OpenBLAS, oneMKL, SYCL, Vulkan).

I'm curious why I cannot go over 6 t/s. Is this an issue with the newer llama.cpp version or with my OS configuration?

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

2 replies

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

8XXD8 Jan 21, 2025

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

ygafarov · 2026-04-24T20:43:34Z

ygafarov
Apr 24, 2026

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	pp512	7567.94 ± 41.96
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	tg128	176.18 ± 1.12

build: 0d0764d (8892)

0 replies

dotdashnotdotsoftware · 2026-04-25T09:24:59Z

dotdashnotdotsoftware
Apr 25, 2026

Framework 13 running AMD Ryzen 5 7640U w/ Radeon 760M Graphics

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	187.02 ± 1.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	17.69 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	219.24 ± 3.57
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	18.89 ± 0.35

build: 187a456 (8910)

2 replies

pt13762104 Apr 25, 2026

"int dot = 0" and "KHR_coopmat"? Weird... Maybe your SDK is not updated? But the performance looked fine?

dotdashnotdotsoftware Apr 25, 2026

@pt13762104 - I've been toying quite a bit recently with lots of random installs/llama.cpp flags etc to see what can be done locally even if it's slow and hit weird cases (e.g. rocm kernel drivers are hard-locked at v6 while the userspace is v7 when using the amdgpu-install script) so I'm not really sure how nicely (assume you mean vulkan SDK) will play on this machine

If you've any requests for more info - happy to provide them:

==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.275

sonic74 · 2026-04-28T16:10:40Z

sonic74
Apr 28, 2026

llama-bench.exe -m .lmstudio\models\TheBloke\Llama-2-7B-GGUF\llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
load_backend: loaded RPC backend from \\win11\F\Download\AI\llama-b8918-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 980M (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from \\win11\F\Download\AI\llama-b8918-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from \\win11\F\Download\AI\llama-b8918-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           pp512 |         52.66 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           tg128 |         24.11 ± 0.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           pp512 |         53.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           tg128 |         24.63 ± 0.09 |

build: e583f3b4f (8918)

sudo RADV_PERFTEST=nogttspill llama-b8918/llama-bench -m .lmstudio/models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
load_backend: loaded RPC backend from /home/sonic/llama-b8918/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 570 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/sonic/llama-b8918/libggml-vulkan.so
load_backend: loaded CPU backend from /home/sonic/llama-b8918/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           pp512 |        226.11 ± 0.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           tg128 |         37.44 ± 0.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           pp512 |        229.23 ± 0.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           tg128 |         40.18 ± 0.00 |

build: e583f3b4f (8918)

6 replies

sonic74 May 9, 2026

I tried a newer build and set Windows 10 to "Best performance", but same result:

llama-bench.exe -m .lmstudio/models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
load_backend: loaded RPC backend from \\win11\F\Download\AI\llama-b9062-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 980M (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from \\win11\F\Download\AI\llama-b9062-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from \\win11\F\Download\AI\llama-b9062-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           pp512 |         52.62 ± 0.01 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |           tg128 |         23.94 ± 0.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           pp512 |         53.69 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |           tg128 |         24.60 ± 0.06 |

build: 093be624c (9062)

GPU power draw goes over 100 W and clock up to 1127 MHz while the test runs, no PerfCap while pp.
I also tried CUDA 12.4 and 13.1 .exes, but they don't see the GPU - though LM Studio sees it.
Driver: latest 582.28
no ReBAR

netrunnereve May 11, 2026
Collaborator Author

Yeah that's odd, I don't know what the problem is. Is it like this with a smaller model as well?

sonic74 May 11, 2026

Seems so (8 GB VRAM):

llama-bench.exe -m .lmstudio/models/lmstudio-community/NVIDIA-Nemotron-3-Nano-4B-GGUF/NVIDIA-Nemotron-3-Nano-4B-Q4_K_M.gguf -ngl 100 -fa 0,1
load_backend: loaded RPC backend from \\win11\f\Download\ai\llama-b9062-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 980M (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from \\win11\f\Download\ai\llama-b9062-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from \\win11\f\Download\ai\llama-b9062-bin-win-vulkan-x64\ggml-cpu-haswell.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h ?B Q4_K - Medium    |   2.63 GiB |     3.97 B | Vulkan     | 100 |  0 |           pp512 |         81.11 ± 0.01 |
| nemotron_h ?B Q4_K - Medium    |   2.63 GiB |     3.97 B | Vulkan     | 100 |  0 |           tg128 |         29.55 ± 0.11 |
| nemotron_h ?B Q4_K - Medium    |   2.63 GiB |     3.97 B | Vulkan     | 100 |  1 |           pp512 |         81.45 ± 0.01 |
| nemotron_h ?B Q4_K - Medium    |   2.63 GiB |     3.97 B | Vulkan     | 100 |  1 |           tg128 |         29.44 ± 0.08 |

build: 093be624c (9062)

FNsi May 11, 2026

Check you pcie connection then?

sonic74 May 11, 2026

"\\win11\C\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\extras\demo_suite\bandwidthTest.exe"
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce GTX 980M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10202.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     9620.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     132036.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

ikmalsaid · 2026-04-30T13:09:53Z

ikmalsaid
Apr 30, 2026

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3060 Ti (NVIDIA) | uma: 0 | fp16: 1 |
bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

.\llama-bench.exe -m "D:\llama-2-7b.Q4_0.gguf" -ngl 100 -fa 0

| model                          |       size |     params | backend    | ngl | fa |           test |                 t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | -------------: | ------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |          pp512 |      2361.43 ± 4.32 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  0 |          tg128 |        86.74 ± 0.34 |

build: b42c7fa5b (8982)

.\llama-bench.exe -m "D:\llama-2-7b.Q4_0.gguf" -ngl 100 -fa 1

| model                          |       size |     params | backend    | ngl | fa |           test |                 t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | -------------: | ------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |          pp512 |     2644.95 ± 26.82 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |  1 |          tg128 |        90.35 ± 1.00 |

build: b42c7fa5b (8982)

0 replies

ygafarov · 2026-05-01T16:12:37Z

ygafarov
May 1, 2026

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	267.27 ± 7.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	19.81 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	311.02 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	21.28 ± 0.01

build: 1e9d771 (8768)

0 replies

digitalscream · 2026-05-02T09:32:35Z

digitalscream
May 2, 2026

Yep, I'll play...big uplift in PP compared with the R9700 in the leaderboard, small increase in TG.

RADV_PERFTEST=nogttspill ./llama-bench -m /opt/working2/llm/models.1/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none --device Vulkan1
load_backend: loaded RPC backend from /main_data/bin/llama.cpp.vulkan/llama-b8963/libggml-rpc.so
WARNING: radv is not a conformant Vulkan implementation, testing use only.
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /main_data/bin/llama.cpp.vulkan/llama-b8963/libggml-vulkan.so
load_backend: loaded CPU backend from /main_data/bin/llama.cpp.vulkan/llama-b8963/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl |     sm | fa | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | ------------ | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  0 | Vulkan1      |           pp512 |      5281.22 ± 48.93 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  0 | Vulkan1      |           tg128 |        133.27 ± 0.33 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  1 | Vulkan1      |           pp512 |      5620.02 ± 60.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |   none |  1 | Vulkan1      |           tg128 |        138.79 ± 0.10 |

build: f9f33654a (8963)

0 replies

ygafarov · 2026-05-03T13:38:35Z

ygafarov
May 3, 2026

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	Vulkan0	pp512	4666.15 ± 12.89
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	Vulkan0	tg128	164.05 ± 0.89
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	pp512	5200.68 ± 16.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	tg128	171.61 ± 2.69

build: d05fe1d (9010)

1 reply

kha84 May 25, 2026

where are 8060 results? You only provided 3090

xxxajk · 2026-05-03T14:31:06Z

xxxajk
May 3, 2026

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	pp512	499.58 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	tg128	73.43 ± 0.08

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	pp512	498.99 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0	tg128	73.42 ± 0.03

build: 7b8443a (8966)

6 replies

xxxajk May 22, 2026

Yes, it does! WITH the p40 at the same time.

netrunnereve May 24, 2026
Collaborator Author

Okay I have no idea what you mean there, as if the dev is Vulkan0 it means that only Vulkan0 is being used. If there's no device set or if you have dev as Vulkan0/Vulkan1 then you're using both cards with split mode.

xxxajk May 26, 2026

I mean that when I use an actual session, both GPUs combine under Vulkan.
And no, I'm not going to pop the cards in and out constantly.
If the report is broken, then the tool must be broken. That's the results I get if I tell it one or the other. it combines them.

netrunnereve May 27, 2026
Collaborator Author

Oh nothing's broken. If you have multiple GPUs the model will automatically run on all of them using layer splitting unless you change the settings.

Like in my case:

ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

default settings, runs on both cards with layer splitting

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	198.66 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.96 ± 0.12

-sm layer -mg 0, which is the same as the default

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	198.30 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.69 ± 0.15

-sm none -mg 0, runs on the first card only

model	size	params	backend	ngl	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	none	pp512	207.79 ± 0.90
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	none	tg128	39.75 ± 0.05

-dev Vulkan0, which is the same as -sm none -mg 0

model	size	params	backend	ngl	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	Vulkan0	pp512	206.61 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	Vulkan0	tg128	39.85 ± 0.07

xxxajk May 27, 2026

That explains it then. :-) Thanks!
Well, at least you get to see that it does combine them, which is a good data point in any case.

thordarsen · 2026-05-30T21:57:07Z

thordarsen
May 30, 2026

Interestingly I get wildly different scores between Windows and Linux on an Arc Pro B50

on Linux (Ubuntu 25.10) intel i7-8700
2f6c815 (9397)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	590.01 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	40.13 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	581.78 ± 1.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	41.82 ± 0.08

in Windows 11 - Ryzen 5700X

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	1186.45 ± 4.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	46.30 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	pp512	1122.53 ± 3.73
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	tg128	47.69 ± 0.06

build: aa46bda (9437)

the testbeds were a little different - but my SYCL results were consistent between the platforms so ¿?

2 replies

0cc4m May 31, 2026
Collaborator

That's a known issue, because of different drivers. Intel puts a lot more work into the Windows driver optimization, sadly.

savvadesogle May 31, 2026

Ubuntu 25.10 (7.0.6-070006-generic)
XE driver (intel mesa v: 26.0.5)
Vulkan v: 1.4.321

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	1209.43 ± 1.62
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	39.13 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	1243.46 ± 2.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	40.62 ± 0.04

build: 22cadc1 (9439)

konradmb · 2026-06-01T22:59:06Z

konradmb
Jun 1, 2026

Vega 64 (undervolted)

Sapphire RX Vega 64 (reference blower-fan type)
This is what a proper undervolt should look like.

Result

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	649.29 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	74.19 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	645.93 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	69.00 ± 0.09

build: 5dcb711 (9464)

LACT profile:

Details

[...]
profiles:
  Undervolt lowest MAX perf:
    gpus:
      1002:687F-1002:6B76-0000:2f:00.0:
        fan_control_enabled: true
        fan_control_settings:
          mode: static
          static_speed: 1.0
          temperature_key: edge
          interval_ms: 500
          curve:
            54: 0.0
            55: 0.05
            60: 0.2
            70: 0.7
            80: 0.9
            90: 1.0
          spindown_delay_ms: 10000
          change_threshold: 3
        power_cap: 330.0
        performance_level: manual
        gpu_vf_curve:
          5:
            voltage: 825
            clockspeed: 1400
          2:
            voltage: 803
          6:
            voltage: 850
            clockspeed: 1520
          7:
            voltage: 1050
            clockspeed: 1580
          4:
            voltage: 825
          1:
            voltage: 802
          0:
            voltage: 801
          3:
            voltage: 825
        mem_vf_curve:
          3:
            voltage: 980
            clockspeed: 1050
        power_profile_mode_index: 5
        power_states:
          core_clock:
          - 7
          memory_clock:
          - 3

Vega 64 (default profile)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	584.98 ± 1.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	67.70 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	575.06 ± 1.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	63.43 ± 0.07

build: 5dcb711 (9464)

0 replies

sswtodo · 2026-06-03T19:57:38Z

sswtodo
Jun 3, 2026

$> uname -a
FreeBSD devmach 16.0-CURRENT FreeBSD 16.0-CURRENT #0 main-n286323-553ef188f7ec: Tue Jun 2 06:42:39 CEST 2026

$> llama-bench -fa 0,1 -ngl 666 -m /zfast/llm/coding/llama-2-7b.Q4_0.gguf

model	size	params	backend	threads	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	16	0	pp512	2786.04 ± 24.56
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	16	0	tg128	151.95 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	16	1	pp512	3699.24 ± 17.47
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	16	1	tg128	160.37 ± 0.46

build: a731805 (9493)

Removing BLAS (OpenBLAS) does not decrease results.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	666	0	pp512	3029.13 ± 171.15
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	666	0	tg128	152.14 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	666	1	pp512	3701.57 ± 9.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	666	1	tg128	160.24 ± 0.92

build: ad1b88c (9525)

Gentoo:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	3404.48 ± 25.66
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	174.66 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	pp512	3493.72 ± 10.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	tg128	178.62 ± 0.20

build: 65ef50a (9501)

2 replies

sswtodo Jun 12, 2026

Your link points to this thread...

jruhe-adesso Jun 13, 2026

My mistake, sorry.

netrunnereve · 2026-06-15T01:44:37Z

netrunnereve Jun 15, 2026
Collaborator Author

Hey in the future please don't put standalone CPU results as a top level reply as it might confuse someone. I've hidden it for now.

twoplan · 2026-06-14T12:11:27Z

twoplan
Jun 14, 2026

Hi,

Intel Core Ultra 7 258V
Ubuntu 26.04
Vulkan binary from download release B9630

./llama-bench -fa 0,1 -m ../../models/llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	pp512	529.31 ± 7.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	0	tg128	20.39 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	pp512	504.02 ± 7.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	-1	1	tg128	21.07 ± 0.09

build: 8ed274e (9630)

also OpenVINO backend

OpenVINO: using device GPU

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	-1	1	pp512	1306.05 ± 1.06
llama 7B Q4_0	3.56 GiB	6.74 B	OPENVINO	-1	1	tg128	18.13 ± 0.10

build: 8ed274e (9630)

SYCL bench results

1 reply

pt13762104 Jun 14, 2026

that's about as fast as a 2060, which isn't slow for a mobile GPU like this... I wonder what can the new PTL GPU do.

Suprisingly Vulkan and SYCL roughly performed the same and is about 2.2x slower than OpenVINO...

aivisionslab-studios · 2026-06-21T00:11:30Z

aivisionslab-studios
Jun 21, 2026

AMD Radeon RX 580 2048SP (mining edition, 8GB) — Windows, AMD proprietary driver

ggml_vulkan: 0 = AMD Radeon RX 580 2048SP (AMD proprietary driver) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | int dot: 0 | matrix cores: none
build: 9a532ae (9222)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	96.83 ± 0.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	17.00 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	102.86 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	18.92 ± 0.07

Noticeably lower than your RX 470 mining edition above (same 2048SP silicon) — almost certainly the Windows proprietary driver vs RADV on Linux, I've seen close to 2x gaps between the two on this exact card on other workloads. Will follow up with a Linux/RADV run on the same hardware for a direct same-chip comparison.

2 replies

netrunnereve Jun 21, 2026
Collaborator Author

Yeah the Windows driver for Polaris is too old and unsupported. There are also a couple bugs with it and llama.cpp which haven't been fixed, for those old cards you should be using Linux and Mesa.

aivisionslab-studios Jun 21, 2026

That tracks with what we've seen — appreciate the confirmation. Will get a Linux/Mesa RADV run on this same card up soon for a direct comparison, then.

This comment was marked as off-topic.

Sign in to view

Performance of llama.cpp with Vulkan #10879

Uh oh!

Uh oh!

netrunnereve Dec 18, 2024 Collaborator

Instructions

Vulkan Scoreboard (Click on the headings to expand the section)

Replies: 285 comments · 460 replies

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

netrunnereve Aug 22, 2025 Collaborator Author

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

netrunnereve Aug 22, 2025 Collaborator Author

Uh oh!

Uh oh!

netrunnereve Feb 25, 2026 Collaborator Author

Uh oh!

0cc4m Feb 25, 2026 Collaborator

Uh oh!

max-krasnyansky Dec 18, 2024 Collaborator

Uh oh!

Uh oh!

Uh oh!

ericcurtin Jan 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

Uh oh!

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

qnixsynapse Jan 9, 2025 Collaborator

Uh oh!

0cc4m Jan 10, 2025 Collaborator

Uh oh!

netrunnereve
Dec 18, 2024
Collaborator

Replies: 285 comments 460 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

netrunnereve Aug 22, 2025
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

netrunnereve Aug 22, 2025
Collaborator Author

netrunnereve Feb 25, 2026
Collaborator Author

0cc4m Feb 25, 2026
Collaborator

max-krasnyansky
Dec 18, 2024
Collaborator

ericcurtin Jan 14, 2025
Collaborator

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

qnixsynapse
Jan 9, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator