tune Qwen3-VL-4B prefill unified-attention on gfx1150 by qingxuamd · Pull Request #1024 · ROCm/vllm

qingxuamd · 2026-06-26T06:41:58Z

tune Qwen3-VL-4B prefill unified-attention, TTFT improve from 959ms to 921ms, verified on gfx1150

This optimization is for gfx1150. The model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, for triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency in prefill. The required input = 2 448x448 img + 256 token, then, tok num ~=660 tok. M: ~660 (2x448x448 + 256 token prefill） N,K（hot shape）： 660 x 19456 x 2560 660 x 2560 x 9728 660 x 6144 x 2560 660 x 2560 x 4096 warps, currently 8 is best num so: BLOCK_M=64 BLOCK_N=256 BLOCK_K=64 num_warps=8 GEMM：M=660，N/K as above kernel tile：64,256,64,8 Signed-off-by: Xu Qing <qing.xu2@amd.com>

Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11 and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of being overwritten by the default path. with the patch, TTFT improve from 959ms to 921ms, verified on gfx1150 Signed-off-by: Xu Qing <qing.xu2@amd.com>

qingxuamd added 2 commits June 24, 2026 16:21

qingxuamd requested a review from dllehr-amd as a code owner June 26, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tune Qwen3-VL-4B prefill unified-attention on gfx1150#1024

tune Qwen3-VL-4B prefill unified-attention on gfx1150#1024
qingxuamd wants to merge 2 commits into
gfx11from
qingxu/qwen3-vl-optimize3

qingxuamd commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

qingxuamd commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant