2023
DOI: 10.1109/tip.2022.3228156
|View full text |Cite
|
Sign up to set email alerts
|

Pose-Appearance Relational Modeling for Video Action Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(2 citation statements)
references
References 56 publications
0
2
0
Order By: Relevance
“…The average accuracy on the KTH dataset is 92.54%, and the CNN network structure is only 5 layers, while the deeper layers of our model can better extract the temporal information of continuous video sequences. The literature [20] used a pose-appearance relational network (PARNet) to identify 14 skeletal key points of the human body and a temporal attention mechanism-based LSTM model (TA-LSTM) for action recognition to capture long-term contextual information in action videos to improve the robustness of the network. And the Spatial Appearance (SA) module was used to improve the aggregation between adjacent frames, with an accuracy of 94.10% tested on the dataset.…”
Section: Testing On the Kth Public Data Setmentioning
confidence: 99%
“…The average accuracy on the KTH dataset is 92.54%, and the CNN network structure is only 5 layers, while the deeper layers of our model can better extract the temporal information of continuous video sequences. The literature [20] used a pose-appearance relational network (PARNet) to identify 14 skeletal key points of the human body and a temporal attention mechanism-based LSTM model (TA-LSTM) for action recognition to capture long-term contextual information in action videos to improve the robustness of the network. And the Spatial Appearance (SA) module was used to improve the aggregation between adjacent frames, with an accuracy of 94.10% tested on the dataset.…”
Section: Testing On the Kth Public Data Setmentioning
confidence: 99%
“…Action recognition is a widely studied problem in the field of computer vision, and numerous approaches have been proposed to tackle it. Traditional approaches [16,17] were mainly based on hand-crafted features, but the recent success of deep learning has led to a shift toward end-to-end learning methods [18,19]. Among them, convolutional neural networks (CNNs) have been widely adopted due to their ability to effectively extract spatial and temporal features from videos.…”
Section: Action Recognitionmentioning
confidence: 99%