Skip to content

tune Qwen3-VL-4B prefill unified-attention on gfx1150#1024

Open
qingxuamd wants to merge 2 commits into
gfx11from
qingxu/qwen3-vl-optimize3
Open

tune Qwen3-VL-4B prefill unified-attention on gfx1150#1024
qingxuamd wants to merge 2 commits into
gfx11from
qingxu/qwen3-vl-optimize3

Conversation

@qingxuamd

Copy link
Copy Markdown

tune Qwen3-VL-4B prefill unified-attention, TTFT improve from 959ms to 921ms, verified on gfx1150

This optimization is for gfx1150.
The model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, for
triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency
in prefill. The required input = 2 448x448 img + 256 token,
then, tok num ~=660 tok.

M: ~660 (2x448x448 + 256 token  prefill)
N,K(hot shape):
660 x 19456 x 2560
660 x 2560 x 9728
660 x 6144 x 2560
660 x 2560 x 4096
warps, currently 8 is best num

so:
BLOCK_M=64
BLOCK_N=256
BLOCK_K=64
num_warps=8

GEMM:M=660,N/K as above
kernel tile:64,256,64,8

Signed-off-by: Xu Qing <qing.xu2@amd.com>
Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11
and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill
tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of
being overwritten by the default path.
with the patch, TTFT improve from 959ms to 921ms, verified on gfx1150

Signed-off-by: Xu Qing <qing.xu2@amd.com>
@qingxuamd qingxuamd requested a review from dllehr-amd as a code owner June 26, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant