2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00265
|View full text |Cite
|
Sign up to set email alerts
|

Making History Matter: History-Advantage Sequence Training for Visual Dialog

Abstract: We study the multi-round response generation in visual dialog, where a response is generated according to a visually grounded conversational history. Given a triplet: an image, Q&A history, and current question, all the prevailing methods follow a codec (i.e., encoder-decoder) fashion in a supervised learning paradigm: a multimodal encoder encodes the triplet into a feature vector, which is then fed into the decoder for the current answer generation, supervised by the ground-truth. However, this conventional s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 63 publications
(37 citation statements)
references
References 32 publications
0
37
0
Order By: Relevance
“…Das et al [ 2 ] and Lu et al [ 7 ] adopted a dialog history attention mechanism to find and focus on past dialog history related to the present question. Wu et al [ 10 ], Guo et al [ 4 ], and Yang et al [ 11 ] proposed a model for applying the co-attention mechanism among the three elements of current question, image, and past dialog history to determine the answer to the current question. Gan et al [ 3 ] proposed a model that repeats co-attention among the three elements several times.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Das et al [ 2 ] and Lu et al [ 7 ] adopted a dialog history attention mechanism to find and focus on past dialog history related to the present question. Wu et al [ 10 ], Guo et al [ 4 ], and Yang et al [ 11 ] proposed a model for applying the co-attention mechanism among the three elements of current question, image, and past dialog history to determine the answer to the current question. Gan et al [ 3 ] proposed a model that repeats co-attention among the three elements several times.…”
Section: Related Workmentioning
confidence: 99%
“…The existing models for visual dialog have been mostly implemented with a large monolithic neural network [ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 ]. However, VQA and visual dialog are composable in nature in that the process of generating an answer to one natural language question can be completed by composing multiple basic neural network modules.…”
Section: Introductionmentioning
confidence: 99%
“…We also include some concurrent work for visual dialog that has not been discussed above, including image-questionanswer synergistic network (Guo et al, 2019), recursive visual attention (Niu et al, 2018), factor graph attention (Schwartz et al, 2019), dual attention network (Kang et al, 2019), graph neural network , history-advantage sequence training (Yang et al, 2019), and weighted likelihood estimation .…”
Section: Concurrent Workmentioning
confidence: 99%
“…Guesser model is evaluated by classification error rate. The 2 baseline models [6]: HRED, HRED-VGG, 3 attention-based models PLAN [28], A-ATT [7], HACAN [25], and 2 Feature-wise Linear Modulation (FiLM) models: single-hop FiLM [14], multi-hop FiLM [23], are compared. Table 3 compares the test error of Guess models.…”
Section: Evaluation Metric and Comparison Modelsmentioning
confidence: 99%