ππ Welcome to the repo of Crab+! If our project helps you, please give us a β on GitHub to support us. ππ
Crab+ is a scalable and unified audio-visual scene understanding model built upon Qwen2.5-Omni-7B with custom I-LoRA (Interaction-aware LoRA) fine-tuning. It addresses the negative transfer problem in multi-task audio-visual learning through explicit cooperation from both data and model perspectives, achieving positive transfer in multi-tasks learning.
- [2026.03.04] Release training and evaluation codes of Crab+.
- [2026.03.04] Crab+ paper is available on arXiv.
Basic dependencies:
- Python == 3.10
- PyTorch == 2.5.1
- Transformers
- DeepSpeed
- PEFT
Install required packages:
git clone https://github.com/GeWu-Lab/Crab_Plus.git
cd Crab_Plus
conda create -n crab python=3.10 -y
conda activate crab
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install -r requirements.txtgit clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e ".[notebooks]"
cd ..
# Download SAM2 checkpoint
mkdir -p sam2/checkpoints
wget -P sam2/checkpoints/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.ptThe JSON annotation files are hosted on HuggingFace:
# Method 1: Using huggingface-cli
huggingface-cli download Jayson236/Crab_Plus AVUIE_2.zip --repo-type dataset --local-dir .
unzip AVUIE_2.zip
# Method 2: Using wget
wget https://huggingface.co/datasets/Jayson236/Crab_Plus/resolve/main/AVUIE_2.zip
unzip AVUIE_2.zipAfter extraction, you should see the AVUIE_2/ directory containing JSON annotation files for all tasks.
Each dataset requires its original audio/video/image files. The expected directory structure:
- Video+Audio datasets:
AVUIE_2/{task}/video/{filename}andAVUIE_2/{task}/audio/{filename} - Image+Audio datasets (ms3, s4, ref_avs, arig):
AVUIE_2/{task}/{relative_path}(paths stored directly in JSON)
AVUIE_2/
βββ a2v/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ v2a/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ ks/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.wav files
βββ ucf/
β βββ train.json, test.json
β βββ video/ # *.avi files (in subdirectories)
β βββ audio/ # *.wav files (in subdirectories)
βββ meld/
β βββ train.json, test.json
β βββ video/ # *.mp4 files (in train/ or test/)
β βββ audio/ # *.mp3 files (in train/ or test/)
βββ mer24/
β βββ train.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.wav files
βββ cremad/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.wav files
βββ mafw/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.wav files
βββ dfew/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.wav files
βββ avqa/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ avqa_thu/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ ave/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ unav/
β βββ train.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ avvp/
β βββ train.json, test.json
β βββ video/ # *.mp4 files
β βββ audio/ # *.mp3 files
βββ avcap/
βββ train.json
βββ video/ # *.mp4 files
βββ audio/ # *.mp3 files
AVUIE_2/
βββ s4/
β βββ train.json, test.json
β βββ AVS/v1s/ # Per-clip directories:
β βββ {clip_id}/
β βββ audio.wav
β βββ frames/ # 0.jpg, 1.jpg, ...
β βββ labels_rgb/ # Ground-truth masks
β βββ labels_semantic/
βββ ms3/
β βββ train.json, test.json
β βββ AVS/v1m/ # Same structure as s4
βββ ref_avs/
β βββ train.json, test.json
β βββ REFAVS/media/
β βββ {clip_id}/
β βββ audio.wav
β βββ frames/
β βββ gt_mask/
βββ arig/
βββ train.json, test.json
βββ AVS/v1s/ # Shares media files with s4
Download Qwen2.5-Omni-7B from HuggingFace:
huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-Omni-7BIf using a local path, update QWEN_OMNI_PATH in the shell scripts.
mkdir -p weight
# Method 1: Using huggingface-cli
huggingface-cli download Jayson236/Crab_Plus finetune_weights.bin --repo-type dataset --local-dir weight/
# Method 2: Using wget
wget -P weight/ https://huggingface.co/datasets/Jayson236/Crab_Plus/resolve/main/finetune_weights.binEdit scripts/finetune/finetune_omni.sh to configure paths, then run:
bash scripts/finetune/finetune_omni.shKey training arguments:
NPROC_PER_NODE=2: Number of GPUs per nodeLOCAL_BATCH_SIZE=4: Per-GPU batch size--num_train_epochs 5: Number of epochs--lora_r 128,--lora_alpha 256: LoRA rank and scaling--deepspeed deepspeed/stage2.json: DeepSpeed ZeRO Stage 2
Edit scripts/finetune/inference_omni.sh to configure paths, then run:
bash scripts/finetune/inference_omni.shFor segmentation tasks (S4, MS3, Ref-AVS), a two-stage pipeline is used:
- Stage 1: Run Crab+ inference to generate predictions with bounding boxes / point coordinates
- Stage 2: Feed predictions into SAM2 for mask generation
cd sam2
# S4 / MS3
bash ../seg/scripts/inference.sh
# Ref-AVS
bash ../seg/scripts/inference_ref.shThis project is built upon the following open-source projects:
If you find Crab+ useful for your research and applications, please cite using this BibTeX:
@inproceedings{du2025crab,
title={Crab: A unified audio-visual scene understanding model with explicit cooperation},
author={Du, Henghui and Li, Guangyao and Zhou, Chang and Zhang, Chunjie and Zhao, Alan and Hu, Di},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18804--18814},
year={2025}
}
@article{cai2026crab,
title={Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation},
author={Cai, Dongnuan and Du, Henghui and Zhou, Chang and Chen, Xi and Guo, Dan and Zhang, Hongyuan and Li, Xuelong and Hu, Di},
journal={arXiv preprint arXiv:2603.04128},
year={2026}
}This project is released under the Apache 2.0 license as found in the LICENSE file.