Inception Transformer

Si, Chenyang; Yu, Wang; Zhou, Pan; Zhou, Yichen; Wang, Xinchao; Yan, Shuicheng

doi:10.48550/arxiv.2205.12956

Cited by 12 publications

(13 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Inception Transformer [45] which has three branches (average pooling, convolution, and self-attention) fused with a depthwise convolution achieves impressive performance on several vision tasks. Our E-Branchformer shares a similar spirit of combing local and global information both sequentially and in parallel.…”

Section: Hybrid -Both Sequentially and In Parallelmentioning

confidence: 99%

See 1 more Smart Citation

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kim¹,

Wu²,

Peng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

show abstract

Section: Hybrid -Both Sequentially and In Parallelmentioning

confidence: 99%

“…Presumably, using nearby information can improve the merge process. Similar to Inception-Transformer [45], we employ a depth-wise convolution to add the spatial information exchanging (as described in Figure 3c). Formally, the outputs from the global Y G and the local Y L branch are merged:…”

Section: Depth-wise Convolutionmentioning

confidence: 99%

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kim¹,

Wu²,

Peng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Xie et al [20] proposed a framework that outputs multi-scale features by a hierarchical Transformer encoder and saves on attentional computation. Recently, Inception Transformer [12] adds an additional Inception module inside a Transformer for extracting high-frequency representation, so that more strong performances of Transformer can be obtained. We, however, aim at combining different Transformers with different reception fields in the Inception style for robust feature abstraction.…”

Section: Transformermentioning

confidence: 99%

“…Interestingly, Transformer [5] can capture the long-range relations. Besides, the parallel structures in the convolutional neural networks (CNN)-based studies, Inception [10] and its variants [11][12][13][14] have been demonstrated to be very effective with rich scales.…”

Section: Introductionmentioning

confidence: 99%

Parallel matters: Efficient polyp segmentation with parallel structured feature augmentation modules

Guo

Fang

Wang

et al. 2023

IET Image Processing

View full text Add to dashboard Cite

The large variations of polyp sizes and shapes and the close resemblances of polyps to their surroundings call for features with long-range information in rich scales and strong discrimination. This article proposes two parallel structured modules for building those features. One is the Transformer Inception module (TI) which applies Transformers with different reception fields in parallel to input features and thus enriches them with more long-range information in more scales. The other is the Local-Detail Augmentation module (LDA) which applies the spatial and channel attentions in parallel to each block and thus locally augments the features from two complementary dimensions for more object details. Integrating TI and LDA, a new Transformer encoder based framework, Parallel-Enhanced Network (PENet), is proposed, where LDA is specifically adopted twice in a coarse-tofine way for accurate prediction. PENet is efficient in segmenting polyps with different sizes and shapes without the interference from the background tissues. Experimental comparisons with state-of-the-arts methods show its merits.

show abstract

“…For the convolution branch, different from [53], [54], [55], which perform convolution with the input features, we instead extract the convolution features from the value V , which is not partitioned into windows. In this way, the convolution layer can explore the correlations among neighboring windows, which further enhances the correlations of tokens along the window borders.…”

Section: Enhanced Transformer Based Feature Extractionmentioning

confidence: 99%

ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution

Shen¹,

Yue²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Nowadays, online screen sharing and remote cooperation are becoming ubiquitous. However, the screen content may be downsampled and compressed during transmission, while it may be displayed on large screens or the users would zoom in for detail observation at the receiver side. Therefore, developing a strong and effective screen content image (SCI) super-resolution (SR) method is demanded. We observe that the weight-sharing upsampler (such as deconvolution or pixel shuffle) could be harmful to sharp and thin edges in SCIs, and the fixed scale upsampler makes it inflexible to fit screens with various sizes. To solve this problem, we propose an implicit transformer network for continuous SCI SR (termed as ITSRN++). Specifically, we propose a modulation based transformer as the upsampler, which modulates the pixel features in discrete space via a periodic nonlinear function to generate features for continuous pixels. To enhance the extracted features, we further propose an enhanced transformer as the feature extraction backbone, where convolution and attention branches are utilized parallelly. Besides, we construct a large scale SCI2K dataset to facilitate the research on SCI SR. Experimental results on nine datasets demonstrate that the proposed method achieves state-of-the-art performance for SCI SR (outperforming SwinIR by 0.74 dB for ×3 SR) and also works well for natural image SR. Our codes and dataset will be released upon the acceptance of this work.

show abstract

Inception Transformer

Cited by 12 publications

References 46 publications

E-Branchformer: Branchformer with Enhanced merging for speech recognition

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Parallel matters: Efficient polyp segmentation with parallel structured feature augmentation modules

ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution

Contact Info

Product

Resources

About