DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

If our project helps you, please give us a star ⭐ and cite our paper!

🌈 Introduction

This repo implements DualToken, a method that unifies representations for both visual understanding and generation within a single tokenizer. Directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks.

Built upon DualToken, we construct an unified MLLM which demonstrates remarkable effectiveness in downstream understanding and generation tasks. The code and weights of our unified MLLM will be released soon.

📰 News

[2026/04/21] 🌟 We have released the inference and training code of our tokenizer. More versions are scheduled to be updated. Please stay tuned!
[2025/03/18] 🌟 We have released the technical report of DualToken. See here!

🔧 Requirements and Installation

Python ≥ 3.11
PyTorch ≥ 2.4.1
transformers == 4.44.0

🚀 Training

To train a tokenizer from scratch, run:

torchrun --nproc_per_node 8 -m main \
    --sem_weight 1 \
    --stage 1 \
    --name siglip-384-rvq8 \
    --model "model_config_siglip_384_rvq8" \
    --save-frequency 1 \
    --train-data="$YOUR_DATA_PATH/cc12/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10000000 \
    --dataset-type "webdataset" \
    --warmup=10000 \
    --batch-size=32 \
    --lr=7.2e-5 \
    --beta1=0.5 \
    --beta2=0.9 \
    --wd=0.0001 \
    --epochs=20 \
    --gan_start_epoch=0 \
    --restart_gan=20 \
    --workers=1

or you can directly run the tokenizer training command:

bash run.sh

Inference

python inference.py

🙇 Acknowledgement

DualToken is built upon the awesome works VILA-U, OpenCLIP, and LLaVA.

📝 Citation

@article{song2025dualtoken,
  title={DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies},
  author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Sun, Haoze and Chen, Weipeng and Zhou, Zenan and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
  journal={arXiv preprint arXiv:2503.14324},
  year={2025}
}

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
asset		asset
model_config_siglip_256_rvq4		model_config_siglip_256_rvq4
model_config_siglip_384_rvq8		model_config_siglip_384_rvq8
modeling		modeling
recon		recon
src		src
vqgan		vqgan
.gitignore		.gitignore
README.md		README.md
compute_fid.py		compute_fid.py
eval.py		eval.py
eval.sh		eval.sh
inference.py		inference.py
main.py		main.py
run.sh		run.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

If our project helps you, please give us a star ⭐ and cite our paper!

🌈 Introduction

📰 News

🔧 Requirements and Installation

🚀 Training

Inference

🙇 Acknowledgement

📝 Citation

LICENSE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

If our project helps you, please give us a star ⭐ and cite our paper!

🌈 Introduction

📰 News

🔧 Requirements and Installation

🚀 Training

Inference

🙇 Acknowledgement

📝 Citation

LICENSE

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages