2022
DOI: 10.48550/arxiv.2204.00452
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Abstract: We propose Multi-head Self/Cross-Attention (MSCA), which introduces a temporal cross-attention mechanism for action recognition, based on the structure of the Multi-head Self-Attention (MSA) mechanism of the Vision Transformer (ViT). Simply applying ViT to each frame of a video frame can capture frame features, but cannot model temporal features. However, simply modeling temporal information with CNN or Transfomer is computationally expensive. TSM that perform feature shifting assume a CNN and cannot take adva… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Spatio-temporal attention For more dedicated structural modeling in the time dimension with ViTs, a mainstream approach in the video domain is to develop various spatio-temporal attention mechanisms by further imposing temporal attention on top [6,2,3,9,85,24]. We choose two representative video ViT models, TimeSformer [6] and XViT [9], in our performance benchmark.…”
Section: Methodsmentioning
confidence: 99%
“…Spatio-temporal attention For more dedicated structural modeling in the time dimension with ViTs, a mainstream approach in the video domain is to develop various spatio-temporal attention mechanisms by further imposing temporal attention on top [6,2,3,9,85,24]. We choose two representative video ViT models, TimeSformer [6] and XViT [9], in our performance benchmark.…”
Section: Methodsmentioning
confidence: 99%
“…Therefore, it is straightforward to introduce the features of temporal and spatial neighbour tokens as additional guidance. To enhance the communication within the neighbourhood, we take advantage of the feature shift technique [90,31,7] that is parameter-free and efficient in use. Considering that each query is attended by key-value pairs from other tokens [31,7] for information exchange, each K and V is reconstructed by sequentially mixing its temporally and spatially adjacent tokens including itself (namely, temporal shift and spatial shift).…”
Section: Neighborhood Associationmentioning
confidence: 99%
“…To enhance the communication within the neighbourhood, we take advantage of the feature shift technique [90,31,7] that is parameter-free and efficient in use. Considering that each query is attended by key-value pairs from other tokens [31,7] for information exchange, each K and V is reconstructed by sequentially mixing its temporally and spatially adjacent tokens including itself (namely, temporal shift and spatial shift). The neighbourhood association is proven to be effective in improving the classification accuracy.…”
Section: Neighborhood Associationmentioning
confidence: 99%
See 1 more Smart Citation