Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs

Ding, Xiaohan; Zhang, Xiangyu; Zhou, Yizhuang; Han, Jungong; Ding, Guiguang; Sun, Jian

doi:10.1109/cvpr52688.2022.01166

Cited by 636 publications

(234 citation statements)

References 50 publications

Supporting

Mentioning

229

Contrasting

Order By: Relevance

“…In computer vision, a line of works [Ding et al, 2019, Guo et al, 2020, Ding et al, 2021, Cao et al, 2022 explored using structural re-parameterization to create 2D convolution kernels. However, most of these works are limited to the vision domain and utilize only short-range convolution kernels (e.g., 7 × 7) with only one exception [Ding et al, 2022], which scales the size of convolution to 31 × 31 with an optimized CUDA kernel. Our SGConv kernel is a special parameterization of global convolution kernels that tackles LRD and showcases the extensibility of re-parameterized kernels.…”

Section: Related Workmentioning

confidence: 99%

What Makes Convolutional Models Great on Long Sequence Modeling?

Li¹,

Cai²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021a] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance. Code is available at https://github.com/ctlllll/SGConv.

show abstract

Section: Related Workmentioning

confidence: 99%

What Makes Convolutional Models Great on Long Sequence Modeling?

Li¹,

Cai²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…where P arallel 3×3,5×5,7×7 contains multi-branch of 3 × 3, 5 × 5, 7 × 7 convolution layers. Following (Ding et al, 2022;Guo et al, 2022), we apply dilated depthwise convolution with kernel size 5 × 5, 7 × 7, dilate rate as 2, 3 to obtain larger receptive filed.…”

Section: Mcamentioning

confidence: 99%

“…Con-vNext built a pure CNN family based on ResNet (He et al, 2016a), which performs on par or slightly better than ViT by learning their training procedure and macro/micro-level architecture designs. RepLKNet (Ding et al, 2022) follows the large kernel design in ViT and proposes to learn long-range relations by adopting as large as 31 × 31 kernel size to enlarge effective receptive fields. Although encouraging performance has been achieved by the above methods, their computation costs are relatively large.…”

Section: Introductionmentioning

confidence: 99%

DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei¹,

Pan²,

Niu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer achieves state-of-the-art performance on ImageNet classification, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs). Moreover, for object detection on COCO and semantic segmentation tasks on ADE20K, ConvFormer also shows excellent performance compared with recently advanced methods. Code and models will be available.

show abstract

“…Thus, information can be gathered from a large region. Inspired by this characteristic of Transformer, a series of works have been proposed to design better CNNs [48,40,12,16]. ConvMixer [48] utilizes large kernel convolutions to build the model and achieve the competitive performance to the ViT [15].…”

Section: Related Workmentioning

confidence: 99%

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

Zhang¹,

Cai²,

Gu³

et al. 2022

Preprint

View full text Add to dashboard Cite

The attention mechanism plays a pivotal role in designing advanced super-resolution (SR) networks. In this work, we design an efficient SR network by improving the attention mechanism. We start from a simple pixel attention module and gradually modify it to achieve better super-resolution performance with reduced parameters. The specific approaches include: (1) increasing the receptive field of the attention branch, (2) replacing large dense convolution kernels with depth-wise separable convolutions, and (3) introducing pixel normalization. These approaches paint a clear evolutionary roadmap for the design of attention mechanisms. Based on these observations, we propose VapSR, the VAst-receptive-field Pixel attention network. Experiments demonstrate the superior performance of VapSR. VapSR outperforms the present lightweight networks with even fewer parameters. And the light version of VapSR can use only 21.68% and 28.18% parameters of IMDB and RFDN to achieve similar performances to those networks. The code and models are available at https://github.com/zhoumumu/VapSR.

show abstract

Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs

Cited by 636 publications

References 50 publications

What Makes Convolutional Models Great on Long Sequence Modeling?

What Makes Convolutional Models Great on Long Sequence Modeling?

DMFormer: Closing the Gap Between CNN and Vision Transformers

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

Contact Info

Product

Resources

About