Vicinity Vision Transformer

Sun, Weixuan; Qin, Z. H.; Deng, Hui; Wang, Jianyuan; Zhang, Yi; Barnes, Nick; Birchfield, Stan; Kong, Lingpeng; Zhong, Yiran

doi:10.48550/arxiv.2206.10552

Cited by 2 publications

(2 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To reduce the complexity close to linear O(N ), linear Transformers [37,63,15,14,64,87,72,71] decompose the similarity function δ(•) to a kernel function ρ(•), where δ(QK…”

Section: Linear Attentionmentioning

confidence: 99%

Linear Video Transformer with Feature Fixation

Lu¹,

Liu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration on critical features. Therefore, we propose a feature fixation module to reweight feature importance of the query and key before computing linear attention. Specifically, we regard the query, key, and value as various latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we enhance the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighbouring tokens. The proposed method significantly improves the linear attention baseline and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. With fewer parameters and higher efficiency, our performance is even comparable to some Softmax-based quadratic Transformers.

show abstract

“…To reduce the complexity close to linear O(N ), linear Transformers [37,63,15,14,64,87,72,71] decompose the similarity function δ(•) to a kernel function ρ(•), where δ(QK…”

Section: Linear Attentionmentioning

confidence: 99%

Linear Video Transformer with Feature Fixation

Lu¹,

Liu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Efficient Transformers [17,20,25,31] have achieved remarkable advances in recent years. They reduce the quadratic computational complexity of the standard Transformer [35] by spasifying or approximating Softmax attention in a more efficient fashion.…”

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search on Efficient Transformers and Beyond

Liu¹,

Liu²,

Lu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, numerous efficient Transformers have been proposed to reduce the quadratic computational complexity of standard Transformers caused by the Softmax attention. However, most of them simply swap Softmax with an efficient attention mechanism without considering the customized architectures specially for the efficient attention. In this paper, we argue that the handcrafted vanilla Transformer architectures for Softmax attention may not be suitable for efficient Transformers. To address this issue, we propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique. The proposed method is validated on popular machine translation and image classification tasks. We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer, but the general accuracy is less comparable. It indicates that the Softmax attention and efficient attention have their own distinctions but neither of them can simultaneously balance the accuracy and efficiency well. This motivates us to mix the two types of attention to reduce the performance imbalance. Besides the search spaces that commonly used in existing NAS Transformer approaches, we propose a new search space that allows the NAS algorithm to automatically search the attention variants along with architectures. Extensive experiments on WMT'14 En-De and CIFAR-10 demonstrate that our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.

show abstract

Vicinity Vision Transformer

Cited by 2 publications

References 31 publications

Linear Video Transformer with Feature Fixation

Linear Video Transformer with Feature Fixation

Neural Architecture Search on Efficient Transformers and Beyond

Contact Info

Product

Resources

About