🎏 JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion
AAAI 2026 Accepted Paper
Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding
News • Installation • Data Prep • Getting Started • Training • Inference • Acknowledgements • Citation
JoDiffusion is a novel framework for semantic segmentation promotion by jointly diffusing images with pixel-level annotations. By leveraging the power of diffusion models, JoDiffusion achieves state-of-the-art results on standard benchmarks.
Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
- [2025-12] 🔥 Code and pre-trained weights are released!
- [2025-11] 🎉 Our paper has been accepted by AAAI 2026!
We recommend using Anaconda to manage the environment.
conda create -n jodiffusion python=3.11 -y
conda activate jodiffusion
pip install -r requirements.txt
Please download the datasets and organize them following the MMSegmentation Dataset Preparation guide.
- ADE20K & Pascal VOC Aug: Follow the standard MMSegmentation structure.
- COCO: Download from the official website.
Convert COCO to a semantic segmentation format and generate captions for the datasets:
# Convert COCO to semantic segmentation format
python utils/prepare_coco_semantic_80.py
# Generate captions (using BLIP-2)
python utils/prepare_ade20k_blip2_captions.py
python utils/prepare_voc_blip2_captions.pyUpdate the data_root path in your dataset configuration file (e.g., dataset/xxx_semantic.py) to point to your local dataset directory.
Before training, run the following script to adjust the pre-trained model weights to fit the label input dimensions:
python inference/jodiffusion.py
Train the Variational Autoencoder (VAE) for label compression.
# Example: Training on ADE20K
bash scripts/ae/ade20k_light.shNote: You can modify
scripts/ae/xx_light.shto adjust hyperparameters like batch size, learning rate, and epochs.💾 Pre-trained Weights: We provide trained VAE weights for three datasets. Download them here.
Train the joint diffusion model using the pre-trained Label VAE.
bash scripts/ldm/ade20k_light_joint_5e5.sh💾 Pre-trained Weights: Pre-trained JoDiffusion weights are also available here.
Generate synthetic images and masks for downstream task evaluation:
bash scripts/gen/ade20k_light.shRefine the generated masks using our optimization script:
bash scripts/optim/ade20k_light.shWe use MMSegmentation to train downstream segmentation models (e.g., Mask2Former, DeepLabV3+) using the generated synthetic data. Please refer to the MMSegmentation docs for training commands.
This repository is built upon the amazing work of:
If you find this code useful for your research, please consider citing our paper:
@article{wang2025jodiffusion,
title={JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion},
author={Wang, Haoyu and Zhang, Lei and Liu, Wenrui and Jiang, Dengyang and Wei, Wei and Ding, Chen},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}