2022
DOI: 10.48550/arxiv.2207.03620
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

Abstract: Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local but large attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
30
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(30 citation statements)
references
References 63 publications
0
30
0
Order By: Relevance
“…Inspired by the success of vision transformers, researchers have challenged the traditional small kernel design of CNNs [22,52] and suggested the use of large convolution kernels for visual tasks [11,17,18,38,40,46,73]. For example, ConvNeXt [40] suggest directly adopting a 7×7 depth-wise convolution, while the Visual Attention Network (VAN) [18] uses a kernel size of 21 × 21 and introduces an attention mechanism.…”
Section: Large Kernel Design In Cnnsmentioning
confidence: 99%
“…Inspired by the success of vision transformers, researchers have challenged the traditional small kernel design of CNNs [22,52] and suggested the use of large convolution kernels for visual tasks [11,17,18,38,40,46,73]. For example, ConvNeXt [40] suggest directly adopting a 7×7 depth-wise convolution, while the Visual Attention Network (VAN) [18] uses a kernel size of 21 × 21 and introduces an attention mechanism.…”
Section: Large Kernel Design In Cnnsmentioning
confidence: 99%
“…Vision transformers (Dosovitskiy et al, 2020) are another popular approach to overcoming the local receptive field of convolutions with small kernel sizes, querying information across distributed image patches. However, in practice the sophisticated architecture is often unnecessary for many computer-vision tasks (Pinto et al, 2022): while simple small-kernel U-Nets generally perform well as their multi-resolution convolutions effectively widen the receptive field (Liu et al, 2022b), increasing the kernel size can boost the performance of convolutional networks beyond that achieved by vision transformers across multiple tasks Liu et al, 2022a).…”
Section: Architecturesmentioning
confidence: 99%
“…Recent works [38,68,70] resort to the existing network structures, e.g. ResNet [22], or MobileNet V2 [55], with several modifications, such as group convolution [12], inverted bottleneck [55], or large kernel [17,36], demonstrating competitive performance on par with Transformers on a similar model scale. Particularly, RepLKNet [17] and SLaK [36] build pure CNN models with a focus on increasing ERF using kernel sizes as large as 31 × 31 and 51 × 51, respectively.…”
Section: Introductionmentioning
confidence: 99%
“…ResNet [22], or MobileNet V2 [55], with several modifications, such as group convolution [12], inverted bottleneck [55], or large kernel [17,36], demonstrating competitive performance on par with Transformers on a similar model scale. Particularly, RepLKNet [17] and SLaK [36] build pure CNN models with a focus on increasing ERF using kernel sizes as large as 31 × 31 and 51 × 51, respectively. While they achieve comparable performance to the Transformer, such explorations of large kernel CNNs are limited to the image classification task.…”
Section: Introductionmentioning
confidence: 99%