Proceedings of the 25th ACM International Conference on Multimedia 2017
DOI: 10.1145/3123266.3123364
|View full text |Cite
|
Sign up to set email alerts
|

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 35 publications
(5 citation statements)
references
References 38 publications
0
5
0
Order By: Relevance
“…One of the first datasets for VQA is YouTube2Text [48] which comprises of 1987 videos and 122,708 descriptions of those in natural language. Many existing models [5,9,13,21] address the task of VQA with the help of YouTube2Text [48]. The LSMDC 2016 description dataset [47] enables multi-sentence description of the videos which is a unique property from the previous approaches of datasets.…”
Section: Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…One of the first datasets for VQA is YouTube2Text [48] which comprises of 1987 videos and 122,708 descriptions of those in natural language. Many existing models [5,9,13,21] address the task of VQA with the help of YouTube2Text [48]. The LSMDC 2016 description dataset [47] enables multi-sentence description of the videos which is a unique property from the previous approaches of datasets.…”
Section: Datasetsmentioning
confidence: 99%
“…In [9], authors study the problem of VQA from the viewpoint of Hierarchical Dual-Level Attention Network (DLAN) learning. On the basis of frame-level and segment-level feature representations, object appearance and movement information from the video is obtained.…”
Section: Dlan (Zao Et Al 2017)mentioning
confidence: 99%
“…Motivated by the success of attention mechanism [53]- [55], a self-attentive pooling is adopted to explicitly capture the varying importance of each sequential unit by assigning it an attention weight. For each modality, the self-attentive pooling takes the whole output sequence…”
Section: ) Self-attentive Sequential Feature Learningmentioning
confidence: 99%
“…), which aims to answer a natural language question according to a video clip, is an important task in multimedia understanding. Modern Video QA systems [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13] require sufficient video-question-answer triples to train. As shown in Figure 1, many Video QA datasets are labeled with caption question generation (CapQG), where question-answer pairs are created using a text question generation system according to captions of a video.…”
Section: Introductionmentioning
confidence: 99%