tune Qwen3-VL-4B prefill unified-attention on gfx1150#1024
Open
qingxuamd wants to merge 2 commits into
Open
Conversation
This optimization is for gfx1150. The model of Qwen3-VL-4B-Instruct-AWQ-4bit-lm_head_int8, for triton_w4a16_skinny_fmt_kernel, it cost more then 60% latency in prefill. The required input = 2 448x448 img + 256 token, then, tok num ~=660 tok. M: ~660 (2x448x448 + 256 token prefill) N,K(hot shape): 660 x 19456 x 2560 660 x 2560 x 9728 660 x 6144 x 2560 660 x 2560 x 4096 warps, currently 8 is best num so: BLOCK_M=64 BLOCK_N=256 BLOCK_K=64 num_warps=8 GEMM:M=660,N/K as above kernel tile:64,256,64,8 Signed-off-by: Xu Qing <qing.xu2@amd.com>
Add a Qwen3-VL-4B prefill shape guard in Triton unified attention on gfx11 and apply BM64/T32/W4/S1/EU4 defaults to reduce TTFT. Also fix prefill tile override wiring so VLLM_UA_PREFILL_TILE_SIZE is honored instead of being overwritten by the default path. with the patch, TTFT improve from 959ms to 921ms, verified on gfx1150 Signed-off-by: Xu Qing <qing.xu2@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tune Qwen3-VL-4B prefill unified-attention, TTFT improve from 959ms to 921ms, verified on gfx1150