Augment Your Batch: Improving Generalization Through Instance Repetition

Hoffer, Elad; Ben-Nun, Tal; Hubara, Itay; Giladi, Niv; Hoefler, Torsten; Soudry, Daniel

doi:10.1109/cvpr42600.2020.00815

Cited by 190 publications

(114 citation statements)

References 13 publications

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…Linear spatial reduction attention (LSRA) [33] is utilized in the first two stages to reduce the computation cost of self-attention for long sequence. [26] 0.1 Drop path [17] 0.1 0.1 0.15 0.3 Repeated augment [15] RandAugment [5] Mixup prob. [40] 0.8 Cutmix prob.…”

Section: Methodsmentioning

confidence: 99%

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

Han¹,

Guo²,

Wang³

2022

Preprint

View full text Add to dashboard Cite

Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https: //github.com/huawei-noah/CV-Backbones/ tree/master/tnt_pytorch.

show abstract

Section: Methodsmentioning

confidence: 99%

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

Han¹,

Guo²,

Wang³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To obtain better generalization and data-efficiency of the model, we perform data augmentation on both images and texts during the pre-training phase to construct more image-text pairs. We apply AutoAugment (Krizhevsky et al, 2012;Sato et al, 2015;Cubuk et al, 2019;Hoffer et al, 2020) for image augmentation, following the SOTA vision recognition methods (Touvron et al, 2021;Xie et al, 2020b). To ensure the augmented texts are semantically similar as the original one, for text augmentation, we rewrite the original text using back-translation (Xie et al, 2020a;Sennrich et al, 2016a).…”

Section: Image and Text Augmentationmentioning

confidence: 99%

FILIP: Fine-grained Interactive Language-Image Pre-Training

Yao¹,

Huang²,

Hou³

et al. 2021

Preprint

View full text Add to dashboard Cite

Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the crossmodal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/selfattention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finergrained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

show abstract

“…We follow the training recipe and augmentations from [20,22] when training from scratch for Kinetics datasets. We adopt synchronized AdamW [58] and train for 200 epochs with 2 repeated augmentation [40] on 128 GPUs. The mini-batch size is 4 clips per GPU.…”

Section: B4 Details: Kinetics Action Classificationmentioning

confidence: 99%

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Li¹,

Wu²,

Fan³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTs' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViT has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 56.1 AP box on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models will be made publicly available.

show abstract

Augment Your Batch: Improving Generalization Through Instance Repetition

Cited by 190 publications

References 13 publications

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

FILIP: Fine-grained Interactive Language-Image Pre-Training

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Contact Info

Product

Resources

About