Pay Less Attention with Lightweight and Dynamic Convolutions

Wu, Felix; Fan, Angela; Baevski, Alexei; Dauphin, Yann N.; Auli, Michael

doi:10.48550/arxiv.1901.10430

Cited by 80 publications

(99 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”

Section: Related Workmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data. In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at https://github.com/sail-sg/volo.

show abstract

Section: Related Workmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Introducing Convolution to Transformers. Convolutions have been used to change the Transformer block in NLP and 2D image recognition, either by replacing multi-head attentions with convolution [48] or adding more convolution layers to capture local correlations [52,26,49]. Different from all the previous works, we propose convolution operation (i.e., EdgeConv [46]) solely on query features to summarize local responses from unordered 3D points to generate global geometric representations, of which the purpose is totally opposite to [26,49].…”

Section: Related Workmentioning

confidence: 99%

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Yu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

General point clouds have been increasingly investigated for different tasks, and recently Transformerbased networks are proposed for point cloud analysis. However, there are barely related works for medical point clouds, which are important for disease detection and treatment. In this work, we propose an attention-based model specifically for medical point clouds, namely 3D medical point Transformer (3DMedPT), to examine the complex biological structures. By augmenting contextual information and summarizing local responses at query, our attention module can capture both local context and global content feature interactions. However, the insufficient training samples of medical data may lead to poor feature learning, so we apply position embeddings to learn accurate local geometry and Multi-Graph Reasoning (MGR) to examine global knowledge propagation over channel graphs to enrich feature representations. Experiments conducted on IntrA dataset proves the superiority of 3DMedPT, where we achieve the best classification and segmentation results. Furthermore, the promising generalization ability of our method is validated on general 3D point cloud benchmarks: ModelNet40 and ShapeNetPart. Code 1 is released.

show abstract

“…For the text-totext sequential modeling (Sutskever et al, 2014), NMT model generally comprises encoder and decoder structure that takes input sequence and generates output sequence auto-regressively. It has been developed to Recurrent Neural Network (RNN) , Convolution Neural Network (CNN) (Gehring et al, 2017;Wu et al, 2019), and Transformer-based model (Vaswani et al, 2017) which outperforms other existing methods. Furthermore, fine-tuning approaches for pre-trained language models have recently shown the best performance including Cross-lingual Language Model Pre-training (XLM) (Lample and Conneau, 2019), Masked Sequence to Sequence Pre-training for Language Generation (MASS) (Song et al, 2019), and Multilingual BART (mBART) .…”

Section: Machine Translationmentioning

confidence: 99%

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Park¹,

Shim²,

Eo³

et al. 2021

Preprint

View full text Add to dashboard Cite

Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other highresource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

show abstract

Pay Less Attention with Lightweight and Dynamic Convolutions

Cited by 80 publications

References 29 publications

VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Contact Info

Product

Resources

About