2022
DOI: 10.48550/arxiv.2201.04288
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multiview Transformers for Video Recognition

Abstract: Video understanding requires reasoning at multiple spatiotemporal resolutions -from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the stateof-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse informatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 40 publications
0
6
0
Order By: Relevance
“…Attention-based video methods [58,60] have initially been used as part of spatio-temporal CNN architectures [6,56]. The recent introduction of Vision Transformer [10], has inspired subsequent works on action recognition by either focusing on how spatio-temporal information can be processed [1,2] or architectural optimizations for spatio-temporal data [12,39,47,66,68]. Motivated by the recent advances of spatio-temporal transformers for action recognition, we combine multiple transformer towers in our TemPr model.…”
Section: Related Workmentioning
confidence: 99%
“…Attention-based video methods [58,60] have initially been used as part of spatio-temporal CNN architectures [6,56]. The recent introduction of Vision Transformer [10], has inspired subsequent works on action recognition by either focusing on how spatio-temporal information can be processed [1,2] or architectural optimizations for spatio-temporal data [12,39,47,66,68]. Motivated by the recent advances of spatio-temporal transformers for action recognition, we combine multiple transformer towers in our TemPr model.…”
Section: Related Workmentioning
confidence: 99%
“…Uniformer [28] is a custom fused CNN-Transformer architecture achieving good speed-accuracy trade-off. Yan et al [52] propose a multi-stream Transformer operating on different resolutions with lateral connections. Prior work [5,3,52] has shown the benefit of image pretraining for Class Score…”
Section: Related Workmentioning
confidence: 99%
“…Yan et al [52] propose a multi-stream Transformer operating on different resolutions with lateral connections. Prior work [5,3,52] has shown the benefit of image pretraining for Class Score…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations