Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.38
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Abstract: Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal coreference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a Multimodal Incremental Trans… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 29 publications
0
8
0
Order By: Relevance
“…Comparison with state-of-the-art. We compare GST with the state-of-the-art approaches on the validation set of the VisDial v1.0 and v0.9 datasets, consisting of UTC [23], MITVG [19], VD-BERT [22], LTMI [18], KBGN [17], DAM [16], ReDAN [12], DMRM [15], Primary [11], RvA [9], CorefNMN [8], CoAtt [7], HCIAE [5], and MN [1]. We decide to use the validation splits since all previous studies benchmarked the models on those splits.…”
Section: Quantitative Results and Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Comparison with state-of-the-art. We compare GST with the state-of-the-art approaches on the validation set of the VisDial v1.0 and v0.9 datasets, consisting of UTC [23], MITVG [19], VD-BERT [22], LTMI [18], KBGN [17], DAM [16], ReDAN [12], DMRM [15], Primary [11], RvA [9], CorefNMN [8], CoAtt [7], HCIAE [5], and MN [1]. We decide to use the validation splits since all previous studies benchmarked the models on those splits.…”
Section: Quantitative Results and Analysismentioning
confidence: 99%
“…Most of the previous approaches in VisDial [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] have trained the dialog agents solely on VisDial data via supervised learning. More recent studies [21][22][23] have employed self-supervised pre-trained models such as BERT [24] or ViLBERT [25] and finetuned them on VisDial data.…”
Section: Introductionmentioning
confidence: 99%
“…Niu et al [120] Selectively referring dialogue history to refine the visual attention until referencing the answer. Chen et al [11] Establishing mapping of visual object and textual entities to exclude undesired visual content.…”
Section: Visual Reference Resolutionmentioning
confidence: 99%
“…The above works all implicitly attend to spatial or object-level image features, which will be inevitably distracted by unnecessary visual content. To address this, Chen et al [11] establish specific mapping of objects in the image and textual entities in the input query and dialogue history, to exclude undesired visual content and reduce attention noise. Additionally, the multimodal incremental transformer integrates visual information and dialogue context to generate visually and contextually coherent responses.…”
Section: Unique Training Schemesbased Vadmentioning
confidence: 99%
“…Recently, with the rise of pre-trained models [2], researchers have begun to explore vision-and-language task [3,4,5] with pre-trained models [1]. Specifically, visual dialog [6,7,8,9], which aims to hold a meaningful conversation with a human about a given image, is a challenging task that requires models have sufficient cross-modal understanding based on both visual and textual context to answer the current question.…”
Section: Introductionmentioning
confidence: 99%