End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

Hori, Chiori; Alamri, Huda; Wang, Jue; Wichern, Gordon; Hori, Takaaki; Cherian, Anoop; Marks, Tim K.; Cartillier, Vincent; Lopes, Raphael Gontijo; Das, Abhishek; Essa, Irfan; Batra, Dhruv; Parikh, Devi

doi:10.1109/icassp.2019.8682583

Cited by 101 publications

(108 citation statements)

References 23 publications

Supporting

Mentioning

105

Contrasting

Unclassified

Order By: Relevance

“…For VQA v1.0 we increase validation set accuracy from 57.0 to 57.3 (no tuning) by replacing the alternating and parallel attention [33]. For AVSD, we improve Hori et al [19] which report a CIDEr score of 0.733 to 0.806. We used FGA to attend to all video cues as well as the question.…”

Section: Quantitative Evaluationmentioning

confidence: 99%

Factor Graph Attention

Schwartz

Hazan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

110

View full text Add to dashboard Cite

Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.

show abstract

Section: Quantitative Evaluationmentioning

confidence: 99%

Factor Graph Attention

Schwartz

Hazan

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

110

View full text Add to dashboard Cite

show abstract

“…We evaluate the proposed approach on the recently introduced and challenging audio-visual sceneaware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr. Recent work on audio-visual scene aware dialog [2,25] partly addresses this shortcoming and proposes a novel Question: what color is the rag ?Answer: it appears to be white .…”

mentioning

confidence: 99%

A Simple Baseline for Audio-Visual Scene-Aware Dialog

Schwing

Hazan

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a datadriven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual sceneaware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20% on CIDEr. Recent work on audio-visual scene aware dialog [2,25] partly addresses this shortcoming and proposes a novel Question: what color is the rag ?Answer: it appears to be white . MultiModal-Attention:Question: where is the video taking place ? MultiModal-Attention:Answer: the video starts with a man in the kitchen . Question:does he speak at all ?Answer: no he does not speak . MultiModal-Attention:Question: do they get up from the chair? MultiModal-Attention:Answer: no , they stay sitting in the chair .

show abstract

“…This suggests the effectiveness of our proposed question-guided video representations for VideoQA. When comparing Table 2: Comparison with existing approaches: Naïve Fusion (Alamri et al, 2019b;Zhuang et al, 2019), Attentional Fusion (Hori et al, 2018;Zhuang et al, 2019), Multi-Source Sequence-to-Sequence model (Pasunuru and Bansal, 2019), Modified Attentional Fusion with Maximum Mutual Information objective (Zhuang et al, 2019) and Hierarchical Attention with pre-trained embedding (Le et al, 2019), on the AVSD public test set. For each approach, we report its corpus-wide scores on BLEU-1 through BLEU-4, METEOR, ROUGE-L and CIDEr.…”

Section: Comparison With Existing Methodsmentioning

confidence: 99%

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Chao

Rastogi

Yavuz

et al. 2019

Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

View full text Add to dashboard Cite

Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset (Alamri et al., 2019a), our proposed models in single-turn and multiturn question answering achieve state-of-theart performance on several automatic natural language generation evaluation metrics.

show abstract

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

Cited by 101 publications

References 23 publications

Factor Graph Attention

Factor Graph Attention

A Simple Baseline for Audio-Visual Scene-Aware Dialog

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Contact Info

Product

Resources

About