Linear dynamical systems approach for human action recognition with dual-stream deep features

Du, Zhouning; Mukaidani, Hiroaki

doi:10.1007/s10489-021-02367-6

Cited by 17 publications

(8 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rest of the methods that include multi-task hierarchical clustering [57], BT-LSTM [58], deep autoencoder [59], two-stream attention LSTM [60], weighted entropy-variance based feature selection [61], dilated CNN+BiLSTM+RB [62], DS-GRU [43], and local-global features + QSVM [63] obtain 89.7%, 85.3%, 96.2%, 96.9%, 94.5%, 89.0%, 97.1%, and 82.6% accuracies, respectively. For the UCF50 dataset, the proposed method dominates the state-of-the-art methods by obtaining the best accuracy of 97.5%, whereas the (LD-BF) + (LD-DF) [64] obtains the second-based accuracy of 96.7%. The local-global features + QSVM [63] achieves the lowest accuracy of 69.4%, whereas the rest of the methods including multi-task hierarchical clustering [57], deep autoencoder [59], ensemble model with sward-based optimization [65], and DS-GRU [43] obtain [57] 2017 89.7 BT-LSTM [58] 2018 85.3 Deep autoencoder [59] 2019 96.2 STDN [56] 2020 98.2 Two-stream attention LSTM [60] 2020 96.9 Weighted entropy-variances based feature selection [61] 2021 94.5 Dilated CNN+BiLSTM+RB [62] 2021 89.0 DS-GRU [43] 2021 97.1 Local-global features + QSVM [63] 2021 82.6 DA-CNN+Bi-GRU (Proposed) 2022 98.0 Finally, for the HMDB51 dataset comprising of challenging action videos, our proposed method achieves the best results by obtaining an accuracy of 79.3%, whereas the runnerup method is evidential deep learning [66] that attains an accuracy of 77.0%.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 94%

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

Ullah¹,

Munir²

2022

Preprint

View full text Add to dashboard Cite

<p>Vision-based human activity recognition has emerged as one of the essential research areas in video analytics domain. Over the last decade, numerous advanced deep learning algorithms have been introduced to recognize complex human actions from video streams. These deep learning algorithms have shown impressive performance for the human activity recognition task. However, these newly introduced methods either exclusively focus on model performance or the effectiveness of these models in terms of computational efficiency and robustness, resulting in a biased tradeoff in their proposals to deal with challenging human activity recognition problem. To overcome the limitations of contemporary deep learning models for human activity recognition, this paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition. For efficient representation of human actions, we have proposed an efficient dual attentional convolutional neural network (CNN) architecture that leverages a unified channel-spatial attention mechanism to extract human-centric salient features in video frames. The dual channel-spatial attention layers together with the convolutional layers learn to be more attentive in the spatial receptive fields having objects over the number of feature maps. The extracted discriminative salient features are then forwarded to stacked bi-directional gated recurrent unit (Bi-GRU) for long-term temporal modeling and recognition of human actions using both forward and backward pass gradient learning. Extensive experiments are conducted on three publicly available human action datasets, where the obtained results verify the effectiveness of our proposed framework over the state-of-the-art methods in terms of model accuracy and inference runtime across each dataset. Experimental results show that the proposed framework attains an improvement in execution time up to 167× in terms of frames per second as compared to most of the contemporary action recognition methods. </p>

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 94%

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

Ullah¹,

Munir²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For the UCF50 dataset, the proposed Student 3DCNN -TUTL method outperforms all the comparative methods by obtaining the best accuracy of 97.6% followed by the runnerup LD-BF + LD-DF method [65], which achieves an accuracy of 97.5%. The local-global features + QSVM [45] method attains the lowest accuracy of 69.4% amongst all comparative method on the UCF50 dataset.…”

Section: 3mentioning

confidence: 94%

“…Multi-task hierarchical clustering [46] 2016 93.2 Deep autoencoder [48] 2019 96.4 Ensembled swarm-based optimization [64] 2021 92.2 DS-GRU [52] 2021 95.2 Local-global features + QSVM [45] 2021 69.4 LD-BF + LD-DF [65] 2022 97.5 Student 3DCNN -TUTL (Ours) 2023 97.6…”

Section: Model Year Accuracy (%)mentioning

confidence: 99%

A 3DCNN-Based Knowledge Distillation Framework for Human Activity Recognition

Ullah

Munir

2023

J. Imaging

View full text Add to dashboard Cite

Human action recognition has been actively explored over the past two decades to further advancements in video analytics domain. Numerous research studies have been conducted to investigate the complex sequential patterns of human actions in video streams. In this paper, we propose a knowledge distillation framework, which distills spatio-temporal knowledge from a large teacher model to a lightweight student model using an offline knowledge distillation technique. The proposed offline knowledge distillation framework takes two models: a large pre-trained 3DCNN (three-dimensional convolutional neural network) teacher model and a lightweight 3DCNN student model (i.e., the teacher model is pre-trained on the same dataset on which the student model is to be trained on). During offline knowledge distillation training, the distillation algorithm trains only the student model to help enable the student model to achieve the same level of prediction accuracy as the teacher model. To evaluate the performance of the proposed method, we conduct extensive experiments on four benchmark human action datasets. The obtained quantitative results verify the efficiency and robustness of the proposed method over the state-of-the-art human action recognition methods by obtaining up to 35% improvement in accuracy over existing methods. Furthermore, we evaluate the inference time of the proposed method and compare the obtained results with the inference time of the state-of-the-art methods. Experimental results reveal that the proposed method attains an improvement of up to 50× in terms of frames per seconds (FPS) over the state-of-the-art methods. The short inference time and high accuracy make our proposed framework suitable for human activity recognition in real-time applications.

show abstract

“…Laboratory experiments show that based on motion capture data and 3D avatars, you can train with only simulation data A recurrent neural network achieves almost perfect results in classifying human actions on real data [10] . Du & Mukaidani discussed a two-stream structure human action recognition method based on a linear dynamic system, and proposed a dual-stream deep feature extraction framework based on a pre-processed convolutional neural network, and veri ed the effectiveness of the method [11] . Sedmidubsky and Zezula proposed an evaluation procedure for 3D human action recognition to determine the best combination in a very effective way [12] .…”

Section: Literature Reviewmentioning

confidence: 99%

Rural folk dance movement recognition based on an improved MCM-SVM model in wireless sensing environment

Xie

2023

Preprint

View full text Add to dashboard Cite

In order to improve the accuracy and timeliness of folk dance movement recognition, this paper proposes an improved MCM-SVM recognition model to recognize the lower limb human motion of ethnic dance in rural areas based on sensors. In order to recognize these actions, the SVM algorithm is used to identify the current action, and the MCM model is used to optimize the recognition result. The experimental results show that the proposed improved model achieves a higher recognition rate compared to the SVM algorithm for the recognition of different dance moves. The average recognition rate exceeds 93%, and the average recognition time is about 0.6 ms, which verifies the effectiveness of the proposed model. The proposed model will provide guidance and practicality for the design and construction of future dance movement recognition systems.

show abstract

Linear dynamical systems approach for human action recognition with dual-stream deep features

Cited by 17 publications

References 40 publications

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework

A 3DCNN-Based Knowledge Distillation Framework for Human Activity Recognition

Rural folk dance movement recognition based on an improved MCM-SVM model in wireless sensing environment

Contact Info

Product

Resources

About