Skip to content

songweii/DualToken

Repository files navigation

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

If our project helps you, please give us a star ⭐ and cite our paper!

Paper PDF Project Page

🌈 Introduction

This repo implements DualToken, a method that unifies representations for both visual understanding and generation within a single tokenizer. Directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks.

teaser

Built upon DualToken, we construct an unified MLLM which demonstrates remarkable effectiveness in downstream understanding and generation tasks. The code and weights of our unified MLLM will be released soon.

teaser

📰 News

  • [2026/04/21] 🌟 We have released the inference and training code of our tokenizer. More versions are scheduled to be updated. Please stay tuned!
  • [2025/03/18] 🌟 We have released the technical report of DualToken. See here!

🔧 Requirements and Installation

  • Python ≥ 3.11
  • PyTorch ≥ 2.4.1
  • transformers == 4.44.0

🚀 Training

To train a tokenizer from scratch, run:

torchrun --nproc_per_node 8 -m main \
    --sem_weight 1 \
    --stage 1 \
    --name siglip-384-rvq8 \
    --model "model_config_siglip_384_rvq8" \
    --save-frequency 1 \
    --train-data="$YOUR_DATA_PATH/cc12/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10000000 \
    --dataset-type "webdataset" \
    --warmup=10000 \
    --batch-size=32 \
    --lr=7.2e-5 \
    --beta1=0.5 \
    --beta2=0.9 \
    --wd=0.0001 \
    --epochs=20 \
    --gan_start_epoch=0 \
    --restart_gan=20 \
    --workers=1

or you can directly run the tokenizer training command:

bash run.sh

Inference

python inference.py

🙇 Acknowledgement

DualToken is built upon the awesome works VILA-U, OpenCLIP, and LLaVA.

📝 Citation

@article{song2025dualtoken,
  title={DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies},
  author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Sun, Haoze and Chen, Weipeng and Zhou, Zenan and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
  journal={arXiv preprint arXiv:2503.14324},
  year={2025}
}

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

About

[ICLR 2026] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors