Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-short.113
|View full text |Cite
|
Sign up to set email alerts
|

Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images

Abstract: In multi-modal dialogue systems, it is important to allow the use of images as part of a multi-turn conversation. Training such dialogue systems generally requires a large-scale dataset consisting of multi-turn dialogues that involve images, but such datasets rarely exist. In response, this paper proposes a 45k multimodal dialogue dataset created with minimal human intervention. Our method to create such a dataset consists of (1) preparing and pre-processing text dialogue datasets, (2) creating image-mixed dia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 12 publications
0
3
0
Order By: Relevance
“…Existing interactive robots/agents using multimodal features have focused on question-answering from images [14], [15], request analysis [2], and conversations about images [16], [17], [18], [19]. Whether incorporating situation understanding results from multimodal cues significantly improves related tasks has been investigated to determine if we can clearly define things to be recognized for tasks.…”
Section: B Using Multimodal Cues For Action Decisionsmentioning
confidence: 99%
“…Existing interactive robots/agents using multimodal features have focused on question-answering from images [14], [15], request analysis [2], and conversations about images [16], [17], [18], [19]. Whether incorporating situation understanding results from multimodal cues significantly improves related tasks has been investigated to determine if we can clearly define things to be recognized for tasks.…”
Section: B Using Multimodal Cues For Action Decisionsmentioning
confidence: 99%
“…The topics of human utterances in a dialogue session are often triggered and grounded by these images, which is inconsistent with our daily communications, where the utterances are not always image-related . Secondly, other groups of datasets, such as OpenViDial 1.0/2.0 (Meng et al, 2020; and dialogues collected by Lee et al (2021), are not originated from a real multi-modal conversation scenario. The former directly extracts dialogues and their visual contexts from movies and TV series, and the latter replaces some utterances with retrieved relevant images.…”
Section: Introductionmentioning
confidence: 99%
“…Then, other groups of works proposed to derive the images from the multi-turn conversations: Meng et al (2020); constructed OpenViDial 1.0/2.0 by directly extracting dialogues and their visual contexts from movies and TV series. Lee et al (2021) also built a multi-modal dialogue dataset by replacing the selected utterances with retrieved relevant images. However, although these corpora were constructed from open-domain conversations with images, they did not originate from a real multi-modal conversation scenario.…”
Section: Introductionmentioning
confidence: 99%