Skeleton‐based action recognition has been continuously and intensively studied. However, dynamic 3D skeleton data are difficult to be popularized in practical applications due to the restricted data acquisition conditions. Although the action recognition method based on 2D pose information extracted from RGB video can effectively avoid the influence of complex background, it is susceptible to factors such as video jitter and joint overlap. To reduce the interference of the aforementioned factors, we use two‐dimensional skeletal joint coordinate modal information to represent the changes in human body posture. First, we use a target detector and pose estimation algorithm to obtain the joint coordinates of each frame sample from RGB video. Then the feature extraction network is combined to perform multi‐level feature learning to establish correspondence between actions and corresponding multi‐level features. Finally, the hierarchical attention mechanism is introduced to design the model named CHAN. By calculating the association between elements, the weight of the action classification is redistributed. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.
Image similarity learning aims to exploit the correlation between different images by learning image appropriate common features. In recent years, the previous CNN-based methods have directly learned the similarity between image features, which effectively improves the learning efficiency of image similarity. However, it has the following limitations: (1) The extracted image features are too single to describe the content of the image accurately; and (2) the network training is limited by the amount of dataset size. Data augmentation and multi-feature fusion have been demonstrated to appropriately improve the model generalization ability for various vision tasks. This paper integrates these methods into the network structure to design three multi-feature fusion network. The input of network training adopts the same data from the multi-input method to realize data augmentation, and the diversity of extracted image features is greatly improved by fusing different features. Then, the trained network of different dataset size is utilized to verify the network training adaptability of multi-feature fusion network. Moreover, the influence of loss function and optimization algorithm on the learning efficiency of complicated networks have been studied. The experimental results show that our proposed method has excellent performance on the self-collected XPU and Totally-Looks-Like (TLL) dataset, the learning and model generalization ability of multi-feature fusion network are significantly improved through data augmentation and multi-feature fusion. The multi-feature fusion network proposed in this paper has strong adaptability to network training.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.