A minimal fork of vLLM with built-in support for Qolda-AVL (Audio-Vision-Language), an extension of Qwen3-VL with audio modality via a Whisper encoder + MLP projector + DeepStack injection at LLM layers.
The companion model checkpoint is available on Hugging Face: issai/Qolda-AVL-5B.
vllm/model_executor/models/qwen3_avl.py-- the Qwen3-AVL model implementation (audio encoder + projection + DeepStack + MRoPE filtering for audio tokens)- Registry entry in
vllm/model_executor/models/registry.pyregisteringQwen3AVLForConditionalGeneration librosaandsoundfileadded to runtime requirements (requirements/common.txt)
uv venv venv
source venv/bin/activate
# Install this fork (precompiled binaries)
git clone https://github.com/IS2AI/vLLM-Qolda-AVL.git
cd vLLM-Qolda-AVL
VLLM_USE_PRECOMPILED=1 uv pip install -e .*Please Adjust to your setup:
vllm serve issai/Qolda-AVL-5B \
--served-model-name qolda-avl
--trust-remote-code \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--max-model-len 16384 \
--limit-mm-per-prompt '{"audio": 1, "image": 1}'This launches an OpenAI-compatible API on http://localhost:8000. The model accepts text, image, audio, and combined audio+image inputs via the standard chat/completions endpoint (with input_audio and image_url content parts).
import base64
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
def encode_audio_base64(path: str | Path) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def encode_image_base64(path: str | Path) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
audio_path = "sample_audio.wav"
audio_b64 = encode_audio_base64(audio_path)
stream = client.chat.completions.create(
model=MODEL_NAME,
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_b64,
"format": "wav",
},
},
{
"type": "text",
"text": (
"Analyze the voice in the audio and identify the speaker's "
"gender (male or female). Also transcribe what is said. "
"Return your answer as JSON in the following format: "
'{"answer": "<male or female>",'
'"transcription": "<transcription>"}'
),
},
],
}
],
max_tokens=4096,
temperature=0.7,
top_p=0.8,
stream=True,
stream_options={"include_usage": True},
)
text = ""
usage = None
for chunk in stream:
if chunk.usage:
usage = chunk.usage
if chunk.choices and chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
print(token, end="", flush=True)
text += tokenApache 2.0 (inherits from upstream vLLM). See LICENSE.