vLLM-Qolda-AVL

A minimal fork of vLLM with built-in support for Qolda-AVL (Audio-Vision-Language), an extension of Qwen3-VL with audio modality via a Whisper encoder + MLP projector + DeepStack injection at LLM layers.

The companion model checkpoint is available on Hugging Face: issai/Qolda-AVL-5B.

Changes on top of upstream vLLM

vllm/model_executor/models/qwen3_avl.py -- the Qwen3-AVL model implementation (audio encoder + projection + DeepStack + MRoPE filtering for audio tokens)
Registry entry in vllm/model_executor/models/registry.py registering Qwen3AVLForConditionalGeneration
librosa and soundfile added to runtime requirements (requirements/common.txt)

Install

uv venv venv
source venv/bin/activate

# Install this fork (precompiled binaries)
git clone https://github.com/IS2AI/vLLM-Qolda-AVL.git
cd vLLM-Qolda-AVL
VLLM_USE_PRECOMPILED=1 uv pip install -e .

Run the OpenAI-compatible server

*Please Adjust to your setup:

vllm serve issai/Qolda-AVL-5B \
    --served-model-name qolda-avl
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --max-model-len 16384 \
    --limit-mm-per-prompt '{"audio": 1, "image": 1}'

This launches an OpenAI-compatible API on http://localhost:8000. The model accepts text, image, audio, and combined audio+image inputs via the standard chat/completions endpoint (with input_audio and image_url content parts).

Inference example (Python OpenAI client)

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

def encode_audio_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def encode_image_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

audio_path = "sample_audio.wav"
audio_b64 = encode_audio_base64(audio_path)

stream = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "wav",
                    },
                },
                {
                    "type": "text",
                    "text": (
                        "Analyze the voice in the audio and identify the speaker's "
                        "gender (male or female). Also transcribe what is said. "
                        "Return your answer as JSON in the following format: "
                        '{"answer": "<male or female>",'
                        '"transcription": "<transcription>"}'
                    ),
                },
            ],
        }
    ],
    max_tokens=4096,
    temperature=0.7,
    top_p=0.8,
    stream=True,
    stream_options={"include_usage": True},
)

text = ""
usage = None
for chunk in stream:
    if chunk.usage:
        usage = chunk.usage
    if chunk.choices and chunk.choices[0].delta.content:
        token = chunk.choices[0].delta.content
        print(token, end="", flush=True)
        text += token

License

Apache 2.0 (inherits from upstream vLLM). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15,572 Commits
cmake		cmake
csrc		csrc
requirements		requirements
vllm		vllm
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM-Qolda-AVL

Changes on top of upstream vLLM

Install

Run the OpenAI-compatible server

Inference example (Python OpenAI client)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM-Qolda-AVL

Changes on top of upstream vLLM

Install

Run the OpenAI-compatible server

Inference example (Python OpenAI client)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages