Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence 2018
DOI: 10.24963/ijcai.2018/513
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network

Abstract: Conversational video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multiturn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 38 publications
(15 citation statements)
references
References 6 publications
0
15
0
Order By: Relevance
“…The existing visual dialog methods (Das et al, 2017a) mainly use recurrent neural network (like LSTM) to encode the dialog history as a single vector representation, which we think might be a bit rough and straightforward. Some more advanced methods (Seo et al, 2017;Zhao et al, 2018) utilize hierarchical structure, attention and memory mechanisms to refine the dialog history representation, which still lacks an explicit reasoning process. Recently, Kottur et al (2018) propose a neural module network architecture including two novel modules, which perform coreference resolution at a word level.…”
Section: What Does the Person Do With It?mentioning
confidence: 99%
See 3 more Smart Citations
“…The existing visual dialog methods (Das et al, 2017a) mainly use recurrent neural network (like LSTM) to encode the dialog history as a single vector representation, which we think might be a bit rough and straightforward. Some more advanced methods (Seo et al, 2017;Zhao et al, 2018) utilize hierarchical structure, attention and memory mechanisms to refine the dialog history representation, which still lacks an explicit reasoning process. Recently, Kottur et al (2018) propose a neural module network architecture including two novel modules, which perform coreference resolution at a word level.…”
Section: What Does the Person Do With It?mentioning
confidence: 99%
“…As for the task of video dialog, it's still less explored. One similar work is proposed by Zhao et al (2018). They study the problem of multi-turn video question answering by employing a hierarchical attention context learning method with recurrent neural networks for context-aware question understanding and a multi-stream attention network that learns the joint video representation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Indeed, we observe that there are more and more VQA datasets (Jang et al 2017;Zhu et al 2016;Ranjay et al 2016) containing multiple questions for a video sequence. This is also a common phenomenon in applications such as education for children (Calhoun 1999) and multi-turn video question answering (Zhao et al 2018). It is worth noting that the semantic relational information among multiple video-question pairs plays an important role for human performing actually reasoning in VQA tasks.…”
Section: Introductionmentioning
confidence: 96%