Crab+: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

🚀🚀 Welcome to the repo of Crab+! If our project helps you, please give us a ⭐ on GitHub to support us. 🙏🙏

Crab+ is a scalable and unified audio-visual scene understanding model built upon Qwen2.5-Omni-7B with custom I-LoRA (Interaction-aware LoRA) fine-tuning. It addresses the negative transfer problem in multi-task audio-visual learning through explicit cooperation from both data and model perspectives, achieving positive transfer in multi-tasks learning.

📰 News

[2026.03.04] Release training and evaluation codes of Crab+.
[2026.03.04] Crab+ paper is available on arXiv.

🛠️ Requirements and Installation

Basic dependencies:

Python == 3.10
PyTorch == 2.5.1
Transformers
DeepSpeed
PEFT

Install required packages:

git clone https://github.com/GeWu-Lab/Crab_Plus.git
cd Crab_Plus
conda create -n crab python=3.10 -y
conda activate crab

# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

pip install -r requirements.txt

Install SAM2 (optional, for segmentation tasks only)

git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e ".[notebooks]"
cd ..

# Download SAM2 checkpoint
mkdir -p sam2/checkpoints
wget -P sam2/checkpoints/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

📦 Data Preparation

Download Dataset Annotations

The JSON annotation files are hosted on HuggingFace:

# Method 1: Using huggingface-cli
huggingface-cli download Jayson236/Crab_Plus AVUIE_2.zip --repo-type dataset --local-dir .
unzip AVUIE_2.zip

# Method 2: Using wget
wget https://huggingface.co/datasets/Jayson236/Crab_Plus/resolve/main/AVUIE_2.zip
unzip AVUIE_2.zip

After extraction, you should see the AVUIE_2/ directory containing JSON annotation files for all tasks.

Download Media Files

Each dataset requires its original audio/video/image files. The expected directory structure:

Video+Audio datasets: AVUIE_2/{task}/video/{filename} and AVUIE_2/{task}/audio/{filename}
Image+Audio datasets (ms3, s4, ref_avs, arig): AVUIE_2/{task}/{relative_path} (paths stored directly in JSON)

Video + Audio Datasets

AVUIE_2/
├── a2v/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── v2a/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── ks/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.wav files
├── ucf/
│   ├── train.json, test.json
│   ├── video/          # *.avi files (in subdirectories)
│   └── audio/          # *.wav files (in subdirectories)
├── meld/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files (in train/ or test/)
│   └── audio/          # *.mp3 files (in train/ or test/)
├── mer24/
│   ├── train.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.wav files
├── cremad/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.wav files
├── mafw/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.wav files
├── dfew/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.wav files
├── avqa/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── avqa_thu/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── ave/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── unav/
│   ├── train.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
├── avvp/
│   ├── train.json, test.json
│   ├── video/          # *.mp4 files
│   └── audio/          # *.mp3 files
└── avcap/
    ├── train.json
    ├── video/          # *.mp4 files
    └── audio/          # *.mp3 files

Image + Audio Datasets (Segmentation-related)

AVUIE_2/
├── s4/
│   ├── train.json, test.json
│   └── AVS/v1s/       # Per-clip directories:
│       └── {clip_id}/
│           ├── audio.wav
│           ├── frames/         # 0.jpg, 1.jpg, ...
│           ├── labels_rgb/     # Ground-truth masks
│           └── labels_semantic/
├── ms3/
│   ├── train.json, test.json
│   └── AVS/v1m/       # Same structure as s4
├── ref_avs/
│   ├── train.json, test.json
│   └── REFAVS/media/
│       └── {clip_id}/
│           ├── audio.wav
│           ├── frames/
│           └── gt_mask/
└── arig/
    ├── train.json, test.json
    └── AVS/v1s/        # Shares media files with s4

⚖️ Model Weights

Base Model

Download Qwen2.5-Omni-7B from HuggingFace:

huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-Omni-7B

If using a local path, update QWEN_OMNI_PATH in the shell scripts.

Fine-tuned LoRA Weights

mkdir -p weight

# Method 1: Using huggingface-cli
huggingface-cli download Jayson236/Crab_Plus finetune_weights.bin --repo-type dataset --local-dir weight/

# Method 2: Using wget
wget -P weight/ https://huggingface.co/datasets/Jayson236/Crab_Plus/resolve/main/finetune_weights.bin

🗝️ Fine-tuning

Edit scripts/finetune/finetune_omni.sh to configure paths, then run:

bash scripts/finetune/finetune_omni.sh

Key training arguments:

NPROC_PER_NODE=2: Number of GPUs per node
LOCAL_BATCH_SIZE=4: Per-GPU batch size
--num_train_epochs 5: Number of epochs
--lora_r 128, --lora_alpha 256: LoRA rank and scaling
--deepspeed deepspeed/stage2.json: DeepSpeed ZeRO Stage 2

🔍 Inference

Edit scripts/finetune/inference_omni.sh to configure paths, then run:

bash scripts/finetune/inference_omni.sh

Segmentation (SAM2)

For segmentation tasks (S4, MS3, Ref-AVS), a two-stage pipeline is used:

Stage 1: Run Crab+ inference to generate predictions with bounding boxes / point coordinates
Stage 2: Feed predictions into SAM2 for mask generation

cd sam2

# S4 / MS3
bash ../seg/scripts/inference.sh

# Ref-AVS
bash ../seg/scripts/inference_ref.sh

🙏 Acknowledgement

This project is built upon the following open-source projects:

📑 Citation

If you find Crab+ useful for your research and applications, please cite using this BibTeX:

@inproceedings{du2025crab,
  title={Crab: A unified audio-visual scene understanding model with explicit cooperation},
  author={Du, Henghui and Li, Guangyao and Zhou, Chang and Zhang, Chunjie and Zhao, Alan and Hu, Di},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18804--18814},
  year={2025}
}

@article{cai2026crab,
  title={Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation},
  author={Cai, Dongnuan and Du, Henghui and Zhou, Chang and Chen, Xi and Guo, Dan and Zhang, Hongyuan and Li, Xuelong and Hu, Di},
  journal={arXiv preprint arXiv:2603.04128},
  year={2026}
}

🔒 License

This project is released under the Apache 2.0 license as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
dataset		dataset
deepspeed		deepspeed
models/qwen2_5_omni		models/qwen2_5_omni
peft_hyper		peft_hyper
scripts		scripts
seg/scripts		seg/scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crab+: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

📰 News

🛠️ Requirements and Installation

Install SAM2 (optional, for segmentation tasks only)

📦 Data Preparation

Download Dataset Annotations

Download Media Files

Video + Audio Datasets

Image + Audio Datasets (Segmentation-related)

⚖️ Model Weights

Base Model

Fine-tuned LoRA Weights

🗝️ Fine-tuning

🔍 Inference

Segmentation (SAM2)

🙏 Acknowledgement

📑 Citation

🔒 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crab+: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

📰 News

🛠️ Requirements and Installation

Install SAM2 (optional, for segmentation tasks only)

📦 Data Preparation

Download Dataset Annotations

Download Media Files

Video + Audio Datasets

Image + Audio Datasets (Segmentation-related)

⚖️ Model Weights

Base Model

Fine-tuned LoRA Weights

🗝️ Fine-tuning

🔍 Inference

Segmentation (SAM2)

🙏 Acknowledgement

📑 Citation

🔒 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages