UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Li, Kunchang; Wang, Yali; Gao, Peng; Song, Guanglu; Liu, Yu; Li, Hongsheng; Qiao, Yu

doi:10.48550/arxiv.2201.04676

Cited by 27 publications

(58 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For image classification, we evaluate iFormer on the ImageNet dataset [28]. We train the iFormer model with the standard procedure in [6,22,29]. Specifically, we use AdamW optimizer with an initial learning rate 1 × 10 −3 via cosine decay [69], a momentum of 0.9, and a weight decay of 0.05.…”

Section: Results On Image Classificationmentioning

confidence: 99%

“…Compared with various Transformer backbones, our iFormers still maintain the performance superiority over their results. For example, our iFormer-B surpasses UniFormer-B [22], Swin-S [5] by 0.9 points of AP b and 3.5 points of AP b respectively.…”

Section: Results On Object Detection and Instance Segmentationmentioning

confidence: 99%

“…LV-ViT [50], LeViT [54] and ViT C [21] further stack several layers of convolutions as the stem for models, which is found helpful in training and achieving better performance. Besides the stem, ViT-hybrid [18], CoAtNet [24], Hybrid-MS [55] and UniFormer [22] design early stages with convolution layers. However, the combination of convolution and attention in a serial order means each layer can only process either high or low frequency and neglects the other part.…”

Section: Related Workmentioning

confidence: 99%

“…Unlike ViTs, they cover more local information through local convolution within the receptive fields, thus effectively extracting high-frequency representations [19,20]. Recent studies [21][22][23][24][25] have integrated CNNs and ViTs considering their complementary advantages. Some methods [21,22,24,25] stack convolution and attention layers in a serial manner to inject the local information into global context.…”

Section: Introductionmentioning

confidence: 99%

“…Recent studies [21][22][23][24][25] have integrated CNNs and ViTs considering their complementary advantages. Some methods [21,22,24,25] stack convolution and attention layers in a serial manner to inject the local information into global context. Unfortunately, this serial manner only models one type of dependency, either global or local, in one layer, and discards the global information during locality modeling, or vice versa.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Inception Transformer

Si¹,

Yu²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high-and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high-and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high-and lowfrequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer. * Equal contribution. Weihao Yu did this work during an internship at Sea AI Lab.Preprint. Under review.

show abstract

Section: Results On Image Classificationmentioning

confidence: 99%

Section: Results On Object Detection and Instance Segmentationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Inception Transformer

Si¹,

Yu²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

Kim

Lee

Cho

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyperparameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF-101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det. DynaAugment also enables video models to learn more generalized representation to improve the model robustness on the corrupted videos.Preprint. Under review.

show abstract