Just wanted to drop a huge thank you for your incredible work on getting vLLM running on the AC922 (ppc64le)!
Compiling PyTorch and Triton from source on PowerPC is no joke, and your clever workaround to skip DeepGEMM/DeepEP for the V100's sm_70 architecture is brilliant. The Dockerfile PR you submitted is an absolute lifesaver for those of us working with POWER9 + V100 setups. It saves us from days of dependency hell and compilation errors. It's awesome to hear that vLLM is running fast on these machines!
I also wanted to share a couple of exciting findings that might complement your work and further boost V100 inference on AC922:
Flash Attention for V100: There is now a repository providing Flash Attention support specifically adapted for the V100 (sm_70): https://github.com/ai-bond/flash-attention-v100. This could be a game-changer for accelerating context processing on our GPUs.
AWQ-Q4 Hardware Acceleration: I also found a vLLM fork that has implemented hardware-accelerated AWQ (Q4) quantization support specifically for V100 GPUs: https://github.com/1CatAI/. Since VRAM capacity is often the bottleneck on V100 (as you rightly pointed out regarding the lack of unified memory in vLLM), running AWQ-Q4 with hardware acceleration might be the perfect way to fit larger models while maintaining high throughput.
Thanks again for paving the way and sharing your hard-earned insights with the community. Your contribution is hugely appreciated!
Just wanted to drop a huge thank you for your incredible work on getting vLLM running on the AC922 (ppc64le)!
Compiling PyTorch and Triton from source on PowerPC is no joke, and your clever workaround to skip DeepGEMM/DeepEP for the V100's sm_70 architecture is brilliant. The Dockerfile PR you submitted is an absolute lifesaver for those of us working with POWER9 + V100 setups. It saves us from days of dependency hell and compilation errors. It's awesome to hear that vLLM is running fast on these machines!
I also wanted to share a couple of exciting findings that might complement your work and further boost V100 inference on AC922:
Flash Attention for V100: There is now a repository providing Flash Attention support specifically adapted for the V100 (sm_70): https://github.com/ai-bond/flash-attention-v100. This could be a game-changer for accelerating context processing on our GPUs.
AWQ-Q4 Hardware Acceleration: I also found a vLLM fork that has implemented hardware-accelerated AWQ (Q4) quantization support specifically for V100 GPUs: https://github.com/1CatAI/. Since VRAM capacity is often the bottleneck on V100 (as you rightly pointed out regarding the lack of unified memory in vLLM), running AWQ-Q4 with hardware acceleration might be the perfect way to fit larger models while maintaining high throughput.
Thanks again for paving the way and sharing your hard-earned insights with the community. Your contribution is hugely appreciated!