2021
DOI: 10.48550/arxiv.2102.05095
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius,
Heng Wang,
Lorenzo Torresani

Abstract: We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best vide… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
337
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 393 publications
(340 citation statements)
references
References 34 publications
3
337
0
Order By: Relevance
“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”
Section: Vision Transformermentioning
confidence: 99%
“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”
Section: Vision Transformermentioning
confidence: 99%
“…Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8]. However, their proposed approach does…”
Section: Related Workmentioning
confidence: 99%
“…The similarity between two embedding vectors is chosen to be the dot product between the two vectors. We use a total of S = 10, 588 steps collected from the T = 1059 tasks used in the evaluation of Bertasius et al [8]. This represents the subset of wikiHow tasks that have at least 100 video samples in the HowTo100M dataset.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 2 more Smart Citations