2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2021
DOI: 10.1109/iccvw54120.2021.00355
|View full text |Cite
|
Sign up to set email alerts
|

Video Transformer Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
112
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 347 publications
(112 citation statements)
references
References 20 publications
0
112
0
Order By: Relevance
“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”
Section: Related Workmentioning
confidence: 99%
“…However, its generic nature and its lack of inductive biases also mean that transformers typically require extremely large amounts of data for training [57,8], or aggressive domain-specific augmentations [72]. This is particularly true for video data, for which transformers are also applicable [51], but where statistical inefficiencies are exacerbated. While videos carry rich temporal information, they can also contain redundant spatial information from neighboring frames.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Following the vision transformer (ViT) [13], which demonstrates competitive performance against CNN models on image classification, many recent works attempt to extend the vision transformer for action recognition [36,25,3,1,14]. VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction. E.g., VTN [36] processes each frame independently and then uses a longformer to aggregate the features across frames.…”
Section: Related Workmentioning
confidence: 99%
“…VTN [36], VidTr [25], TimeSformer [3] and ViViT [1] share the same concept that inserts a temporal modeling module into the existing ViT to enhance the features from the temporal direction. E.g., VTN [36] processes each frame independently and then uses a longformer to aggregate the features across frames. On the other hand, divided-space-time modeling in TimeSformer [4] inserts a temporal attention module into each transformer encoder for more fine-grained temporal interaction.…”
Section: Related Workmentioning
confidence: 99%