Skip to content

IS2AI/vLLM-Qolda-AVL

Repository files navigation

vLLM-Qolda-AVL

A minimal fork of vLLM with built-in support for Qolda-AVL (Audio-Vision-Language), an extension of Qwen3-VL with audio modality via a Whisper encoder + MLP projector + DeepStack injection at LLM layers.

The companion model checkpoint is available on Hugging Face: issai/Qolda-AVL-5B.

Changes on top of upstream vLLM

  • vllm/model_executor/models/qwen3_avl.py -- the Qwen3-AVL model implementation (audio encoder + projection + DeepStack + MRoPE filtering for audio tokens)
  • Registry entry in vllm/model_executor/models/registry.py registering Qwen3AVLForConditionalGeneration
  • librosa and soundfile added to runtime requirements (requirements/common.txt)

Install

uv venv venv
source venv/bin/activate

# Install this fork (precompiled binaries)
git clone https://github.com/IS2AI/vLLM-Qolda-AVL.git
cd vLLM-Qolda-AVL
VLLM_USE_PRECOMPILED=1 uv pip install -e .

Run the OpenAI-compatible server

*Please Adjust to your setup:

vllm serve issai/Qolda-AVL-5B \
    --served-model-name qolda-avl
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --max-model-len 16384 \
    --limit-mm-per-prompt '{"audio": 1, "image": 1}'

This launches an OpenAI-compatible API on http://localhost:8000. The model accepts text, image, audio, and combined audio+image inputs via the standard chat/completions endpoint (with input_audio and image_url content parts).

Inference example (Python OpenAI client)

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

def encode_audio_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def encode_image_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

audio_path = "sample_audio.wav"
audio_b64 = encode_audio_base64(audio_path)

stream = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "wav",
                    },
                },
                {
                    "type": "text",
                    "text": (
                        "Analyze the voice in the audio and identify the speaker's "
                        "gender (male or female). Also transcribe what is said. "
                        "Return your answer as JSON in the following format: "
                        '{"answer": "<male or female>",'
                        '"transcription": "<transcription>"}'
                    ),
                },
            ],
        }
    ],
    max_tokens=4096,
    temperature=0.7,
    top_p=0.8,
    stream=True,
    stream_options={"include_usage": True},
)

text = ""
usage = None
for chunk in stream:
    if chunk.usage:
        usage = chunk.usage
    if chunk.choices and chunk.choices[0].delta.content:
        token = chunk.choices[0].delta.content
        print(token, end="", flush=True)
        text += token

License

Apache 2.0 (inherits from upstream vLLM). See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors