Pose-Appearance Relational Modeling for Video Action Recognition

Cui, Mengmeng; Wang, Wei; Zhang, Kunbo; Sun, Zhenan; Wang, Liang

doi:10.1109/tip.2022.3228156

Cited by 12 publications

(2 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The average accuracy on the KTH dataset is 92.54%, and the CNN network structure is only 5 layers, while the deeper layers of our model can better extract the temporal information of continuous video sequences. The literature [20] used a pose-appearance relational network (PARNet) to identify 14 skeletal key points of the human body and a temporal attention mechanism-based LSTM model (TA-LSTM) for action recognition to capture long-term contextual information in action videos to improve the robustness of the network. And the Spatial Appearance (SA) module was used to improve the aggregation between adjacent frames, with an accuracy of 94.10% tested on the dataset.…”

Section: Testing On the Kth Public Data Setmentioning

confidence: 99%

Worker Abnormal Behavior Recognition Based on Spatio-Temporal Graph Convolution and Attention Model

Zhang

Han

et al. 2023

Electronics

View full text Add to dashboard Cite

In response to the problem where many existing research models only consider acquiring the temporal information between sequences of continuous skeletons and in response to the lack of the ability to model spatial information, this study proposes a model for recognizing worker falls and lays out abnormal behaviors based on human skeletal key points and a spatio-temporal graph convolutional network (ST-GCN). Skeleton extraction of the human body in video sequences was performed using Alphapose. To resolve the problem of graph convolutional networks not being effective enough for skeletal key points feature aggregation, we propose an NAM-STGCN model that incorporates a normalized attention mechanism. By using the activation function PReLU to optimize the model structure, the improved ST-GCN model can more effectively extract skeletal key points action features in the spatio-temporal dimension for the purposes of abnormal behavior recognition. The experimental results show that our optimized model achieves a 96.72% accuracy for recognition on the self-built dataset, which is 4.92% better than the original model; the model loss value converges below 0.2. Tests were performed on the KTH and Le2i datasets, which are both better than typical classification recognition networks. The model can precisely identify abnormal human behaviors, facilitating the detection of abnormalities and rescue in a timely manner and offering novel ideas for smart site construction.

show abstract

Section: Testing On the Kth Public Data Setmentioning

confidence: 99%

Worker Abnormal Behavior Recognition Based on Spatio-Temporal Graph Convolution and Attention Model

Zhang

Han

et al. 2023

Electronics

View full text Add to dashboard Cite

show abstract

“…Action recognition is a widely studied problem in the field of computer vision, and numerous approaches have been proposed to tackle it. Traditional approaches [16,17] were mainly based on hand-crafted features, but the recent success of deep learning has led to a shift toward end-to-end learning methods [18,19]. Among them, convolutional neural networks (CNNs) have been widely adopted due to their ability to effectively extract spatial and temporal features from videos.…”

Section: Action Recognitionmentioning

confidence: 99%

Dilated Multi-Temporal Modeling for Action Recognition

Zou

2023

Applied Sciences

View full text Add to dashboard Cite

Action recognition involves capturing temporal information from video clips where the duration varies with videos for the same action. Due to the diverse scale of temporal context, uniform size kernels utilized in convolutional neural networks (CNNs) limit the capability of multiple-scale temporal modeling. In this paper, we propose a novel dilated multi-temporal (DMT) module that provides a solution for modeling multi-temporal information in action recognition. By using dilated convolutions with different dilation rates in different feature map channels, the DMT module captures information at multiple scales without the need for costly multi-branch networks, input-level frame pyramids, or feature map stacking that previous works have usually incurred. Therefore, this approach enables the integration of temporal information from multiple scales. In addition, the DMT module can be integrated into existing 2D CNNs, making it a straightforward and intuitive solution for addressing the challenge of multi-temporal modeling. Our proposed method has demonstrated promising results in performance and has achieved about 2% and 1% accuracy improvement on FineGym99 and SthV1. We conducted an empirical analysis that demonstrates how DMT improves the classification accuracy for action classes with varying durations.

show abstract