2022
DOI: 10.48550/arxiv.2205.12956
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Inception Transformer

Abstract: Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high-and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 46 publications
0
13
0
Order By: Relevance
“…Recently, Inception Transformer [45] which has three branches (average pooling, convolution, and self-attention) fused with a depthwise convolution achieves impressive performance on several vision tasks. Our E-Branchformer shares a similar spirit of combing local and global information both sequentially and in parallel.…”
Section: Hybrid -Both Sequentially and In Parallelmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, Inception Transformer [45] which has three branches (average pooling, convolution, and self-attention) fused with a depthwise convolution achieves impressive performance on several vision tasks. Our E-Branchformer shares a similar spirit of combing local and global information both sequentially and in parallel.…”
Section: Hybrid -Both Sequentially and In Parallelmentioning
confidence: 99%
“…Presumably, using nearby information can improve the merge process. Similar to Inception-Transformer [45], we employ a depth-wise convolution to add the spatial information exchanging (as described in Figure 3c). Formally, the outputs from the global Y G and the local Y L branch are merged:…”
Section: Depth-wise Convolutionmentioning
confidence: 99%
“…Xie et al [20] proposed a framework that outputs multi-scale features by a hierarchical Transformer encoder and saves on attentional computation. Recently, Inception Transformer [12] adds an additional Inception module inside a Transformer for extracting high-frequency representation, so that more strong performances of Transformer can be obtained. We, however, aim at combining different Transformers with different reception fields in the Inception style for robust feature abstraction.…”
Section: Transformermentioning
confidence: 99%
“…Interestingly, Transformer [5] can capture the long-range relations. Besides, the parallel structures in the convolutional neural networks (CNN)-based studies, Inception [10] and its variants [11][12][13][14] have been demonstrated to be very effective with rich scales.…”
Section: Introductionmentioning
confidence: 99%
“…For the convolution branch, different from [53], [54], [55], which perform convolution with the input features, we instead extract the convolution features from the value V , which is not partitioned into windows. In this way, the convolution layer can explore the correlations among neighboring windows, which further enhances the correlations of tokens along the window borders.…”
Section: Enhanced Transformer Based Feature Extractionmentioning
confidence: 99%