ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682583
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-ofthe-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
105
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 101 publications
(108 citation statements)
references
References 23 publications
2
105
0
1
Order By: Relevance
“…For VQA v1.0 we increase validation set accuracy from 57.0 to 57.3 (no tuning) by replacing the alternating and parallel attention [33]. For AVSD, we improve Hori et al [19] which report a CIDEr score of 0.733 to 0.806. We used FGA to attend to all video cues as well as the question.…”
Section: Quantitative Evaluationmentioning
confidence: 99%
“…For VQA v1.0 we increase validation set accuracy from 57.0 to 57.3 (no tuning) by replacing the alternating and parallel attention [33]. For AVSD, we improve Hori et al [19] which report a CIDEr score of 0.733 to 0.806. We used FGA to attend to all video cues as well as the question.…”
Section: Quantitative Evaluationmentioning
confidence: 99%
“…We evaluate the proposed approach on the recently introduced and challenging audio-visual sceneaware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr. Recent work on audio-visual scene aware dialog [2,25] partly addresses this shortcoming and proposes a novel Question: what color is the rag ?Answer: it appears to be white .…”
mentioning
confidence: 99%
“…This suggests the effectiveness of our proposed question-guided video representations for VideoQA. When comparing Table 2: Comparison with existing approaches: Naïve Fusion (Alamri et al, 2019b;Zhuang et al, 2019), Attentional Fusion (Hori et al, 2018;Zhuang et al, 2019), Multi-Source Sequence-to-Sequence model (Pasunuru and Bansal, 2019), Modified Attentional Fusion with Maximum Mutual Information objective (Zhuang et al, 2019) and Hierarchical Attention with pre-trained embedding (Le et al, 2019), on the AVSD public test set. For each approach, we report its corpus-wide scores on BLEU-1 through BLEU-4, METEOR, ROUGE-L and CIDEr.…”
Section: Comparison With Existing Methodsmentioning
confidence: 99%