The world's elderly population continues to grow at an unprecedented rate, creating a need to monitor the safety of an aging population. One of the current problems is accurately classifying elderly physical activities, especially falling down, and delivering prompt assistance to someone in need. With the advent of deep learning, there has been extensive research on vision-based action recognition and fall detection in particular using human pose estimation employing 2D human pose information. Nevertheless, due to a lack of large-scale elderly fall datasets and the continuation of numerous challenges such as varying camera angles, illumination, and occlusion accurately classifying falls has been a problematic. To address these problems, this research first carried out a comprehensive study of the AI Hub dataset collected from real lives of elderly people in order to benchmark the performance of state-of-the-art human pose estimation methods. Secondly, owing to the limited number of real datasets, augmentation with synthetic data was applied and performance improvement was validated based on changes in the degree of accuracy. Third, this study shows that a Transformer network applied to elderly action recognition outperforms LSTM-based networks by a noticeable margin. Lastly, by observing the quantitative and qualitative performances of different networks, this paper proposes an efficient solution for elderly activity recognition and fall detection in the context of surveillance cameras.
Combining skeleton and RGB modalities in human action recognition (HAR) has garnered attention due to their ability to complement each other. However, previous studies did not address the challenge of recognizing fine-grained human-object interaction (HOI). To tackle this problem, this study introduces a new transformer-based architecture called Sequential Skeleton RGB Transformer (SSRT), which fuses skeleton and RGB modalities. First, SSRT leverages the strength of Long Short-Term Memory (LSTM) and a multi-head attention mechanism to extract high-level features from both modalities. Subsequently, SSRT employs a two-stage fusion method, including transformer cross-attention fusion and softmax layer late score fusion, to effectively integrate the multimodal features. Aside from evaluating the proposed method on fine-grained HOI, this study also assesses its performance on two other action recognition tasks: general HAR and cross-dataset HAR. Furthermore, this study conducts a performance comparison between a HAR model using single-modality features (RGB and skeleton) alongside multimodality features on all three action recognition tasks. To ensure a fair comparison, comparable state-of-the-art transformer architectures are employed for both the single-modality HAR model and SSRT. In terms of modality, SSRT outperforms the best-performing single-modality HAR model on all three tasks, with accuracy improved by 9.92% on fine-grained HOI recognition, 6.73% on general HAR, and 11.08% on cross-dataset HAR. Additionally, the proposed fusion model surpasses state-of-the-art multimodal fusion techniques like Transformer Early Concatenation, with an accuracy improved by 6.32% on fine-grained HOI recognition, 4.04% on general HAR, and 6.56% on cross-dataset.INDEX TERMS Multimodality fusion, human action recognition, fine-grained actions, transformer crossattention fusion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.