If our project helps you, please give us a star ⭐ and cite our paper!
This repo implements DualToken, a method that unifies representations for both visual understanding and generation within a single tokenizer. Directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks.
Built upon DualToken, we construct an unified MLLM which demonstrates remarkable effectiveness in downstream understanding and generation tasks. The code and weights of our unified MLLM will be released soon.
- [2026/04/21] 🌟 We have released the inference and training code of our tokenizer. More versions are scheduled to be updated. Please stay tuned!
- [2025/03/18] 🌟 We have released the technical report of DualToken. See here!
- Python ≥ 3.11
- PyTorch ≥ 2.4.1
- transformers == 4.44.0
To train a tokenizer from scratch, run:
torchrun --nproc_per_node 8 -m main \
--sem_weight 1 \
--stage 1 \
--name siglip-384-rvq8 \
--model "model_config_siglip_384_rvq8" \
--save-frequency 1 \
--train-data="$YOUR_DATA_PATH/cc12/cc12m-train-{0000..2175}.tar" \
--train-num-samples 10000000 \
--dataset-type "webdataset" \
--warmup=10000 \
--batch-size=32 \
--lr=7.2e-5 \
--beta1=0.5 \
--beta2=0.9 \
--wd=0.0001 \
--epochs=20 \
--gan_start_epoch=0 \
--restart_gan=20 \
--workers=1or you can directly run the tokenizer training command:
bash run.shpython inference.pyDualToken is built upon the awesome works VILA-U, OpenCLIP, and LLaVA.
@article{song2025dualtoken,
title={DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies},
author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Sun, Haoze and Chen, Weipeng and Zhou, Zenan and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
journal={arXiv preprint arXiv:2503.14324},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.



