2021
DOI: 10.1134/s105466182103024x
|View full text |Cite
|
Sign up to set email alerts
|

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…In the video, the 2D skeleton information is first extracted, then the 2D skeleton information is converted into 3D skeleton information, and then the 3D information is used as the input of the algorithm model for action recognition. The 2D attitude estimation algorithm uses the RMPE algorithm, which belongs to the top-down attitude estimation method and is an improvement of the SPPE algorithm to solve the problem of inaccurate and redundant detection frame positions [21]. It is divided into three steps: human frame detection, human pose estimation, and non-maximum suppression (Figure 3).…”
Section: Methodsmentioning
confidence: 99%
“…In the video, the 2D skeleton information is first extracted, then the 2D skeleton information is converted into 3D skeleton information, and then the 3D information is used as the input of the algorithm model for action recognition. The 2D attitude estimation algorithm uses the RMPE algorithm, which belongs to the top-down attitude estimation method and is an improvement of the SPPE algorithm to solve the problem of inaccurate and redundant detection frame positions [21]. It is divided into three steps: human frame detection, human pose estimation, and non-maximum suppression (Figure 3).…”
Section: Methodsmentioning
confidence: 99%
“…In contrast to 2D CNN, the convolutional filters of 3D CNN are 3D. The third dimension in the convolutional filters of 3D CNN is the time depth [44,45]. This allows for constant temporal characteristics, enabling 3D CNN to extract spatiotemporal coupling relationships from input data [46].…”
Section: The Delay Prediction Modelmentioning
confidence: 99%
“…Zebhi et al [23] used Gait History Image (GHI) and gradient to extract spatiotemporal features and the time-sliced averaged gradient boundary to characterise motion, using a VGG-16 classifier. Wang et al [24] extracted spatial features using gradient and extracted motion features using optical flow, which were then classified using two different 3D CNNs. Khan et al [25] used DenseNet201 and InceptionV3 to extract features and a Kurtosis-controlled Weighted KNN as the final classifier.…”
Section: State Of the Artmentioning
confidence: 99%