Compared with ordinary image classification tasks, fine-grained image classification is closer to real-life scenes. Its key point is how to find the local areas with sufficient discrimination and perform effective feature learning. Based on a bilinear convolutional neural network (B-CNN), this paper designs a local importance representation convolutional neural network (LIR-CNN) model, which can be divided into three parts. Firstly, the super-pixel segmentation convolution method is used for the input layer of the model. It allows the model to receive images of different sizes and fully considers the complex geometric deformation of the images. Then, we replaced the standard convolution of B-CNN with the proposed local importance representation convolution. It can score each local area of the image using learning to distinguish their importance. Finally, channelwise convolution is proposed and it plays an important role in balancing lightweight network and classification accuracy. Experimental results on the benchmark datasets (e.g., CUB-200-2011, FGVC-Aircraft, and Stanford Cars) showed that the LIR-CNN model had good performance in fine-grained image classification tasks.
Computer vision systems are insensitive to the scale of objects in natural scenes, so it is important to study the multi-scale representation of features. Res2Net implements hierarchical multi-scale convolution in residual blocks, but its random grouping method affects the robustness and intuitive interpretability of the network. We propose a new multi-scale convolution model based on multiple attention. It introduces the attention mechanism into the structure of a Res2-block to better guide feature expression. First, we adopt channel attention to score channels and sort them in descending order of the feature's importance (Channels-Sort). The sorted residual blocks are grouped and intra-block hierarchically convolved to form a single attention and multi-scale block (AMS-block). Then, we implement channel attention on the residual small blocks to constitute a dual attention and multi-scale block (DAMS-block). Introducing spatial attention before sorting the channels to form multi-attention multi-scale blocks(MAMS-block). A MAMS-convolutional neural network (CNN) is a series of multiple MAMS-blocks. It enables significant information to be expressed at more levels, and can also be easily grafted into different convolutional structures. Limited by hardware conditions, we only prove the validity of the proposed ideas through convolutional networks of the same magnitude. The experimental results show that the convolution model with an attention mechanism and multi-scale features is superior in image classification.Appl. Sci. 2020, 10, 101 2 of 18 bottom-up models [16-18] and task-driven top-down models [19][20][21]. The research on the combination of deep learning and attention mechanisms has received considerable attention [22][23][24]. The goal of this study was to select information from the input that was relatively important to the current task. In deep neural networks, researchers often use masks to achieve attention. They demarcate key features in the data by training additional multi-layer weights. This approach naturally embeds the attention mechanism into the deep network structure and participates in end-to-end training. Attention models are well suited for solving computer vision tasks such as image classification, saliency analysis, and object detection.The same object will show different shapes in different natural scenes. When a computer vision system senses an unfamiliar scene, it cannot predict the scale of the object in the image in advance. Therefore, it is necessary to observe image information at different scales. The multi-scale representation of images can be divided into two types: Multi-scale space and multi-resolution pyramid. The difference between them is that multi-scale space has the same resolution at diverse scales. In the visual tasks, the multi-scale approach of multi-resolution pyramid processing targets can be separated into two categories: Image pyramid and feature pyramid. The image pyramid works best but the time and space complexity is high, and the feature pyramid is widely us...
The focus of fine-grained image classification tasks is to ignore interference information and grasp local features. This challenge is what the visual attention mechanism excels at. Firstly, we have constructed a two-level attention convolutional network, which characterizes the object-level attention and the pixel-level attention. Then, we combine the two kinds of attention through a second-order response transform algorithm. Furthermore, we propose a clustering-based grouping attention model, which implies the part-level attention. The grouping attention method is to stretch all the semantic features, in a deeper convolution layer of the network, into vectors. These vectors are clustered by a vector dot product, and each category represents a special semantic. The grouping attention algorithm implements the functions of group convolution and feature clustering, which can greatly reduce the network parameters and improve the recognition rate and interpretability of the network. Finally, the low-level visual features and high-level semantic information are merged by a multi-level feature fusion method to accurately classify fine-grained images. We have achieved good results without using pre-training networks and fine-tuning techniques.
With the continuous evolution of research on convolutional neural networks, it is an efficient and fashionable method to introduce attention mechanism into the convolutional structure. The channel attention designed in SENet has made a great contribution to the promotion of the attention convolution model. However, our research found that SENet focuses on certain feature channels rather than objects in the channels. It will simultaneously enhance or weaken the target objects and background information in a certain channel. On the basis of the channel attention convolution network, we first perform channel sorting and group convolution on the feature map, and expand each group to β times the original feature channel during the group convolution process to construct a channel expansion convolution network (CENet), where β is an array used to represent the channel expansion coefficient. CENet captures the attention of objects in the feature channel while expanding the proportion of features in the relatively important channel. Furthermore, we improved the structure of CENet and merged it into the intra-layer multi-scale convolutional model to construct an object-level attention multi-scale convolutional neural network (OAMS-CNN). We have conducted a large number of experiments on four data sets, CIFAR-10, CIFAR-100, FGVC-Aircraft and Stanford Cars. The experimental results show that our proposed new object-level attention convolution model has achieved good image classification results. INDEX TERMS Channel expansion network, object-level attention CNN, multi-scale CNN, image classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.