Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

[TOIS 2024] Official implementation of DKMD, a dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems.

Authors

Xiaolin Chen¹, Xuemeng Song¹*, Liqiang Jing¹, Shuo Li¹, Linmei Hu², Liqiang Nie¹*

¹ Shandong University, Shandong, China
² Beijing Institute of Technology, Beijing, China
* Corresponding authors

Links

Paper: ACM Digital Library

Updates

[10/2023] Paper accepted at ACM Transactions on Information Systems (TOIS)
[10/2023] Release code and parameters

Introduction

This repository is the official implementation of the paper "Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model", published in ACM Transactions on Information Systems (TOIS), 2024.

Text response generation for multimodal task-oriented dialog systems is an essential yet challenging task. Existing efforts still suffer from two pivotal limitations: (1) overlooking the benefit of generative pre-training, and (2) ignoring the textual context-related knowledge. To address these limitations, we propose DKMD (Dual Knowledge-enhanced generative pretrained language Model for multimodal task-oriented Dialog systems), where BART is adopted as the backbone. DKMD consists of three key components:

Dual Knowledge Selection: Selects context-related knowledge from the knowledge base according to both textual and visual modalities of the given context.
Dual Knowledge-enhanced Context Learning: Seamlessly integrates the selected knowledge into the multimodal context learning from both global and local perspectives, while exploring the cross-modal semantic relation via dual cross-modal representation refinement.
Knowledge-enhanced Response Generation: Comprises a revised BART decoder with an additional dot-product knowledge-decoder attention (DKDA) sub-layer to explicitly use knowledge for precise text response generation.

Highlights

Among the first to integrate generative pretrained language models (GPLMs) into multimodal task-oriented dialog systems
Proposes dual knowledge selection to acquire context-related knowledge from both textual and visual modalities
Designs dual cross-modal representation refinement (vision-oriented and text-oriented) to capture cross-modal semantic relations
Devises a knowledge-enhanced BART decoder with dot-product knowledge-decoder attention for precise response generation
Achieves state-of-the-art performance on public multimodal task-oriented dialog benchmark

Method / Framework

Figure 1. Overall framework of DKMD, which consists of three vital components: (a) Dual Knowledge Selection, (b) Dual Knowledge-enhanced Context Learning, and (c) Knowledge-enhanced Response Generation.

Project Structure

.
├── asserts/               # Figures and framework diagrams
├── config/                # Configuration files
├── dataset/               # Dataset and data processing scripts
├── lib/                   # Library dependencies
├── model/                 # Model architecture definitions
├── target_file/           # Target files for evaluation
├── tools/                 # Utility tools
├── util/                  # Utility functions
├── constant.py            # Constants and hyperparameters
├── train.py               # Training script
├── train.sh               # Shell script for training
├── eval_2.sh              # Shell script for evaluation
├── README.md
└── ...

Installation

1. Clone the repository

git clone https://github.com/iLearn-Lab/DKMD.git
cd DKMD

2. Prerequisite

Python 3.8
PyTorch 1.0
NLTK 3.7
transformers 4.3.2

Usage

Training

sh train.sh <gpu_id> text <model_file> <output_file>

Evaluation

Perl script mteval-v14.pl is used to evaluate the text result. You should first extract the result from the log files and convert them into XML file. For convenience, the convert.py is provided.

Citation

If you find this work useful for your research, please cite our paper:

@article{chen2024dkmd,
  title={Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model},
  author={Chen, Xiaolin and Song, Xuemeng and Jing, Liqiang and Li, Shuo and Hu, Linmei and Nie, Liqiang},
  journal={ACM Transactions on Information Systems},
  volume={42},
  number={2},
  pages={1--28},
  year={2024},
  publisher={ACM},
}

Acknowledgement

Thanks to our collaborators for their valuable support.
Thanks to the open-source community for providing useful baselines and tools.

License

This project is released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Project Structure

Installation

1. Clone the repository

2. Prerequisite

Usage

Training

Evaluation

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config		config
dataset		dataset
lib		lib
tools		tools
util		util
README.md		README.md
constant.py		constant.py
eval_2.sh		eval_2.sh
train.py		train.py
train.sh		train.sh

Folders and files

Latest commit

History

Repository files navigation

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Authors

Links

Table of Contents

Updates

Introduction

Highlights

Method / Framework

Project Structure

Installation

1. Clone the repository

2. Prerequisite

Usage

Training

Evaluation

Citation

Acknowledgement

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages