Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements

He, Wei; Li, Zhi; Lu, Dongcai; Chen, Enhong; Xu, Tong; Huai, Baoxing; Yuan, Jing

doi:10.1145/3394171.3413679

Cited by 24 publications

(16 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, Nie et al [7] devised a multimodal dialog system with multiple decoders, which can generate diverse responses according to the user's intention and adaptively integrate the related knowledge. Recently, some studies have resorted to Transformer [21] to investigate the multimodal dialog systems [8], [9] due to its impressive results in natural language processing (NLP) tasks [10], [11], [12], [22], [23]. For example, He et al [8] introduced a Transformer-based element-level encoder, which can capture the semantic dependencies of multimodal elements (i.e., words and images) via the attention mechanism.…”

Section: Task-oriented Dialog Systemsmentioning

confidence: 99%

“…Existing multimodal task-oriented dialog systems mainly adopt the encoder-decoder framework for text response generation. In particular, recent studies have recognized the pivotal role of the knowledge base for multimodal dialog systems, and designed various schemes for incorporating knowledge to enhance the user's intention understanding [2], [3], [4], [5], [6], [7], [8], [9]. Although they have achieved significant progress, these research efforts suffer from two key limitations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Chen¹,

Song²,

Jing³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text response generation for multimodal task-oriented dialog systems, which aims to generate the proper text response given the multimodal context, is an essential yet challenging task. Although existing efforts have achieved compelling success, they still suffer from two pivotal limitations: 1) overlook the benefit of generative pre-training, and 2) ignore the textual context related knowledge.To address these limitations, we propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD), consisting of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation. To be specific, the dual knowledge selection component aims to select the related knowledge according to both textual and visual modalities of the given context. Thereafter, the dual knowledge-enhanced context learning component targets seamlessly integrating the selected knowledge into the multimodal context learning from both global and local perspectives, where the cross-modal semantic relation is also explored. Moreover, the knowledge-enhanced response generation component comprises a revised BART decoder, where an additional dot-product knowledge-decoder attention sub-layer is introduced for explicitly utilizing the knowledge to advance the text response generation. Extensive experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.

show abstract

Section: Task-oriented Dialog Systemsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Chen¹,

Song²,

Jing³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…It can incorporate different forms of domain knowledge for different intents through intention classification, and generate general responses, knowledge-aware responses, as well as multimodal responses through adaptive decoders. Moreover, combining with transformer [30], He et al [13] advanced a multimodal dialog system via capturing context-aware dependencies of semantic elements (MATE). This model uses relevant images and ordinal information in the dialog history to generate context-aware responses in the text response generation task.…”

Section: Multimodal Dialog Systemsmentioning

confidence: 99%

Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

Zhang

Gao

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Multimodal dialog system has attracted increasing attention from both academia and industry over recent years. Although existing methods have achieved some progress, they are still confronted with challenges in the aspect of question understanding (i.e., user intention comprehension). In this paper, we present a relational graph-based context-aware question understanding scheme, which enhances the user intention comprehension from local to global. Specifically, we first utilize multiple attribute matrices as the guidance information to fully exploit the product-related keywords from each textual sentence, strengthening the local representation of user intentions. Afterwards, we design a sparse graph attention network to adaptively aggregate effective context information for each utterance, completely understanding the user intentions from a global perspective. Moreover, extensive experiments over a benchmark dataset show the superiority of our model compared with several state-of-the-art baselines. CCS CONCEPTS• Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

“…In (Liao et al, 2018), a chat session is modeled as a reinforcement learning procedure, and a reward is formed to optimize the answer. He et al (2020) further consider the influence of the order of historical information images and text information on answers with a self-attention block. Comparatively, we unify the text generation and meme prediction into a long sequence procedure and solve them with a cross-modal GPT-based language model.…”

Section: A2 Technical Difference With Other Multimodal Dialogue Modelsmentioning

confidence: 99%

Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark

Fei¹,

Li²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

As a kind of new expression elements, Internet memes are popular and extensively used in online chatting scenarios since they manage to make dialogues vivid, moving, and interesting. However, most current dialogue researches focus on text-only dialogue tasks. In this paper, we propose a new task named as Meme incorporated Open-domain Dialogue (MOD). Compared to previous dialogue tasks, MOD is much more challenging since it requires the model to understand the multimodal elements as well as the emotions behind them. To facilitate the MOD research, we construct a large-scale open-domain multimodal dialogue dataset incorporating abundant Internet memes into utterances. The dataset consists of ∼45K Chinese conversations with ∼606K utterances. Each conversation contains about 13 utterances with about 4 Internet memes on average and each utterance equipped with an Internet meme is annotated with the corresponding emotion. In addition, we present a simple and effective method, which utilizes a unified generation network to solve the MOD task. Experimental results demonstrate that our method trained on the proposed corpus is able to achieve expressive communication including texts and memes. The corpus and models have been publicly available at https:// github.com/lizekang/DSTC10-MOD.

show abstract

Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements

Cited by 24 publications

References 29 publications

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark

Contact Info

Product

Resources

About