2019
DOI: 10.48550/arxiv.1901.10430
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pay Less Attention with Lightweight and Dynamic Convolutions

Abstract: Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
98
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 80 publications
(99 citation statements)
references
References 29 publications
1
98
0
Order By: Relevance
“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”
Section: Related Workmentioning
confidence: 99%
“…Introducing Convolution to Transformers. Convolutions have been used to change the Transformer block in NLP and 2D image recognition, either by replacing multi-head attentions with convolution [48] or adding more convolution layers to capture local correlations [52,26,49]. Different from all the previous works, we propose convolution operation (i.e., EdgeConv [46]) solely on query features to summarize local responses from unordered 3D points to generate global geometric representations, of which the purpose is totally opposite to [26,49].…”
Section: Related Workmentioning
confidence: 99%
“…For the text-totext sequential modeling (Sutskever et al, 2014), NMT model generally comprises encoder and decoder structure that takes input sequence and generates output sequence auto-regressively. It has been developed to Recurrent Neural Network (RNN) , Convolution Neural Network (CNN) (Gehring et al, 2017;Wu et al, 2019), and Transformer-based model (Vaswani et al, 2017) which outperforms other existing methods. Furthermore, fine-tuning approaches for pre-trained language models have recently shown the best performance including Cross-lingual Language Model Pre-training (XLM) (Lample and Conneau, 2019), Masked Sequence to Sequence Pre-training for Language Generation (MASS) (Song et al, 2019), and Multilingual BART (mBART) .…”
Section: Machine Translationmentioning
confidence: 99%