Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Ding, Xiaoyan; Zhang, Xiangyu; Zhou, Yizhuang; Han, Jungong; Ding, Guiguang; Sun, Jun

doi:10.48550/arxiv.2203.06717

Cited by 36 publications

(60 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concurrent works. We notice three concurrent works, including ConvNeXT [42], RepLKNet [14] and Visual Attention Networks (VAN) [20]. All these works are motivated by large receptive field and exploit convolutions with large or dilated kernels as the main building block.…”

Section: Convolutionsmentioning

confidence: 99%

Focal Modulation Networks

Yang¹,

Li²,

Gao³

2022

Preprint

View full text Add to dashboard Cite

In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 2 and 384 2 , respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1× already surpasses Swin trained with 3× schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications.Code is available at: https://github.com/microsoft/FocalNet.

show abstract

Section: Convolutionsmentioning

confidence: 99%

Focal Modulation Networks

Yang¹,

Li²,

Gao³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…ConvNeXt [27] uses 7 × 7 depth-wise kernels to redesign a standard ResNet and achieves comparable results to Transformers. RepLKNet [7] enlarges the convolution kernel to 31 × 31 to build a pure CNN model, which obtains better results than Swin Transformer [26] on Ima-geNet. Unlike these methods that focus on building big models for high-level vision tasks, we explore the possibility of large convolution kernels for lightweight model design in image super-resolution.…”

Section: Related Workmentioning

confidence: 99%

“…Regular convolution with large kernels is also a simple but heavyweight approach to obtaining efficient receptive fields. To make large kernel convolutions practical, using depth-wise convolutions with large kernel sizes [27,39,7] is an effective alternative. Since depth-wise convolutions share connection weights between spatial locations and remain independent between channels, this property makes it challenging to capture sufficient interactions.…”

Section: Introductionmentioning

confidence: 99%

“…However, these networks hierarchically extract local features, which highly rely on stacking deeper or more complex models to enlarge the receptive fields for performance improvements. As a result, the required computational budget makes these heavy SR models difficult to deploy on resource-constrained mobile devices in practical applications [44].To alleviate heavy SR models, various methods have been proposed to reduce model complexity or speed up runtime, including efficient operation design [32,28,36,9,16,1,33,43,23,27], neural architecture search [6,35], knowledge distillation [12,13], and structural re-parameterization methodology [7,23,44]. These methods are mainly based on improved small spatial convolutions or advanced training strategies, and large kernel convolutions are rarely explored.…”

mentioning

confidence: 99%

“…To alleviate heavy SR models, various methods have been proposed to reduce model complexity or speed up runtime, including efficient operation design [32,28,36,9,16,1,33,43,23,27], neural architecture search [6,35], knowledge distillation [12,13], and structural re-parameterization methodology [7,23,44]. These methods are mainly based on improved small spatial convolutions or advanced training strategies, and large kernel convolutions are rarely explored.…”

mentioning

confidence: 99%

See 2 more Smart Citations

ShuffleMixer: An Efficient ConvNet for Image Super-Resolution

Sun¹,

Pan²,

Tang³

2022

Preprint

View full text Add to dashboard Cite

Lightweight and efficiency are critical drivers for the practical application of image super-resolution (SR) algorithms. We propose a simple and effective approach, ShuffleMixer, for lightweight image super-resolution that explores large convolution and channel split-shuffle operation. In contrast to previous SR models that simply stack multiple small kernel convolutions or complex operators to learn representations, we explore a large kernel ConvNet for mobile-friendly SR design. Specifically, we develop a large depth-wise convolution and two projection layers based on channel splitting and shuffling as the basic component to mix features efficiently. Since the contexts of natural images are strongly locally correlated, using large depth-wise convolutions only is insufficient to reconstruct fine details. To overcome this problem while maintaining the efficiency of the proposed module, we introduce Fused-MBConvs into the proposed network to model the local connectivity of different features. Experimental results demonstrate that the proposed ShuffleMixer is about 6× smaller than the state-of-the-art methods in terms of model parameters and FLOPs while achieving competitive performance. In NTIRE 2022, our primary method won the model complexity track of the Efficient Super-Resolution Challenge [23]. The code is available at https://github.com/sunny2109/MobileSR-NTIRE2022.Recently, convolutional neural network (CNN) based SR models [8,9,1,16,25,45] have achieved impressive reconstruction performance. However, these networks hierarchically extract local features, which highly rely on stacking deeper or more complex models to enlarge the receptive fields for performance improvements. As a result, the required computational budget makes these heavy SR models difficult to deploy on resource-constrained mobile devices in practical applications [44].To alleviate heavy SR models, various methods have been proposed to reduce model complexity or speed up runtime, including efficient operation design [32,28,36,9,16,1,33,43,23,27], neural architecture search [6,35], knowledge distillation [12,13], and structural re-parameterization methodology [7,23,44]. These methods are mainly based on improved small spatial convolutions or advanced training strategies, and large kernel convolutions are rarely explored. Moreover, they mostly focus on one of the efficiency indicators and do not perform well in real resource-constrained tasks. Thus, the need to obtain a better trade-off between complexity, latency, and SR quality is imperative.Preprint. Under review.

show abstract

Face Super-Resolution with Spatial Attention Guided by Multiscale Receptive-Field Features

Huang

Lan²,

Wang³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Face super-resolution (FSR) is dedicated to the restoration of high-resolution (HR) face images from their low-resolution (LR) counterparts. Many deep FSR methods exploit facial prior knowledge (e.g., facial landmark and parsing map) related to facial structure information to generate HR face images. However, directly training a facial prior estimation network with deep FSR model requires manually labeled data, and is often computationally expensive. In addition, inaccurate facial priors may degrade super-resolution performance. In this paper, we propose a residual FSR method with spatial attention mechanism guided by multiscale receptive-field features (MRF) for converting LR face images (i.e., 16 × 16) to HR face images (i.e., 128 × 128). With our spatial attention mechanism, we can recover local details in face images without explicitly learning the prior knowledge. Quantitative and qualitative experiments show that our method outperforms state-of-the-art FSR methods.

show abstract

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Cited by 36 publications

References 85 publications

Focal Modulation Networks

Focal Modulation Networks

ShuffleMixer: An Efficient ConvNet for Image Super-Resolution

Face Super-Resolution with Spatial Attention Guided by Multiscale Receptive-Field Features

Contact Info

Product

Resources

About