2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020
DOI: 10.1109/wacv45572.2020.9093596
|View full text |Cite
|
Sign up to set email alerts
|

BERT Representations for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
49
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 92 publications
(49 citation statements)
references
References 30 publications
0
49
0
Order By: Relevance
“…Anderson et al [1] exploits object-level attention with bottom-up attention, then associates the output sequences with salient image regions via top-down mechanism. More recently, self-attention networks introduced by Transformers [27] are widely adapted in both language and vision tasks [9,22,26,39,40]. Guo et al [11] normalizes the self-attention module in the transformer to solve the internal covariate shift.…”
Section: Related Workmentioning
confidence: 99%
“…Anderson et al [1] exploits object-level attention with bottom-up attention, then associates the output sequences with salient image regions via top-down mechanism. More recently, self-attention networks introduced by Transformers [27] are widely adapted in both language and vision tasks [9,22,26,39,40]. Guo et al [11] normalizes the self-attention module in the transformer to solve the internal covariate shift.…”
Section: Related Workmentioning
confidence: 99%
“…For example, LSTMs [12] are used to embed the temporal visual and textual features in [15], and attention mechanisms [24], which allows to focus only on specific parts of the input, bring significant improvements in [6,30,34,35]. Other work [32,33] applies Transformers [5] to capture the information from videos. For better fusion of independent visual and textual sources, Hirota et al [10,11] propose to use the textual representations to understand the visual sources.…”
Section: Related Workmentioning
confidence: 99%
“…It integrates the information in multiple modalities and therefore is expected to perform better prediction than the case using any unimodal information [1]. Nowadays it has been applied in a broad range of applications, such as multimedia event detection [2,3], sentiment analysis [1,4], cross-modal translation [5][6][7], Visual Question Answering (VQA) [8,9], etc.…”
Section: Introductionmentioning
confidence: 99%