Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Hashiguchi, Ryota; Tamaki, Toru

doi:10.48550/arxiv.2204.00452

Cited by 2 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Spatio-temporal attention For more dedicated structural modeling in the time dimension with ViTs, a mainstream approach in the video domain is to develop various spatio-temporal attention mechanisms by further imposing temporal attention on top [6,2,3,9,85,24]. We choose two representative video ViT models, TimeSformer [6] and XViT [9], in our performance benchmark.…”

Section: Methodsmentioning

confidence: 99%

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Pan¹,

Lin²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (∼8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-theart video models, whilst enjoying the advantage of parameter efficiency. * Equal contribution Preprint. Under review.

show abstract

Section: Methodsmentioning

confidence: 99%

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Pan¹,

Lin²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, it is straightforward to introduce the features of temporal and spatial neighbour tokens as additional guidance. To enhance the communication within the neighbourhood, we take advantage of the feature shift technique [90,31,7] that is parameter-free and efficient in use. Considering that each query is attended by key-value pairs from other tokens [31,7] for information exchange, each K and V is reconstructed by sequentially mixing its temporally and spatially adjacent tokens including itself (namely, temporal shift and spatial shift).…”

Section: Neighborhood Associationmentioning

confidence: 99%

“…To enhance the communication within the neighbourhood, we take advantage of the feature shift technique [90,31,7] that is parameter-free and efficient in use. Considering that each query is attended by key-value pairs from other tokens [31,7] for information exchange, each K and V is reconstructed by sequentially mixing its temporally and spatially adjacent tokens including itself (namely, temporal shift and spatial shift). The neighbourhood association is proven to be effective in improving the classification accuracy.…”

Section: Neighborhood Associationmentioning

confidence: 99%

“…We hence propose to use neighbourhood association as extra guidance for feature fixation. Specifically, we reconstruct each key and value vectors (they are responsible for information exchange between tokens [31,7]) by sequentially mixing key/value features of nearby tokens in the spatial and temporal domain. This is efficiently realized by employing the feature shift technique [90,97].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Linear Video Transformer with Feature Fixation

Lu¹,

Liu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration on critical features. Therefore, we propose a feature fixation module to reweight feature importance of the query and key before computing linear attention. Specifically, we regard the query, key, and value as various latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we enhance the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighbouring tokens. The proposed method significantly improves the linear attention baseline and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. With fewer parameters and higher efficiency, our performance is even comparable to some Softmax-based quadratic Transformers.

show abstract

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Cited by 2 publications

References 17 publications

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Linear Video Transformer with Feature Fixation

Contact Info

Product

Resources

About