2019
DOI: 10.1049/iet-cvi.2018.5014
|View full text |Cite
|
Sign up to set email alerts
|

Learning to recognise 3D human action from a new skeleton‐based representation using deep convolutional neural networks

Abstract: Recognizing human actions in untrimmed videos is an important challenging task. An effective 3D motion representation and a powerful learning model are two key factors influencing recognition performance. In this paper we introduce a new skeletonbased representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a color encoding process. By normalizing the 3D joint coordina… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 34 publications
(17 citation statements)
references
References 83 publications
0
16
0
1
Order By: Relevance
“…These results demonstrate the effectiveness of the proposed representation and deep learning framework since they surpass previous state-of-the-art techniques such as Lie Group Representation [23], Hierarchical RNN [37], Dynamic Skeletons [93], Two-Layer P-LSTM [39], ST-LSTM Trust Gates [40], Geometric Features [74], Two-Stream RNN [91], Enhanced Skeleton [94], Lie Group Skeleton+CNN [95], and GCA-LSTM [92]. The experimental results have also shown that the proposed method leads to better overall action recognition performance than our previous models including Skeleton-based ResNet [51] and SPMF Inception-ResNet-222 [48]. With a high recognition rate on the Cross-View evaluation (86.82%) where the sequences provided by cameras 2 and 3 are used for training and sequences from camera 1 are used for test, the proposed method shows its effectiveness for dealing with view-independent action recognition problem.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…These results demonstrate the effectiveness of the proposed representation and deep learning framework since they surpass previous state-of-the-art techniques such as Lie Group Representation [23], Hierarchical RNN [37], Dynamic Skeletons [93], Two-Layer P-LSTM [39], ST-LSTM Trust Gates [40], Geometric Features [74], Two-Stream RNN [91], Enhanced Skeleton [94], Lie Group Skeleton+CNN [95], and GCA-LSTM [92]. The experimental results have also shown that the proposed method leads to better overall action recognition performance than our previous models including Skeleton-based ResNet [51] and SPMF Inception-ResNet-222 [48]. With a high recognition rate on the Cross-View evaluation (86.82%) where the sequences provided by cameras 2 and 3 are used for training and sequences from camera 1 are used for test, the proposed method shows its effectiveness for dealing with view-independent action recognition problem.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
“…This makes the training and inference processes much simpler and faster. Third, as shown in our previous works [48,51], the spatio–temporal dynamics of skeleton sequences can be transformed into color images—a kind of 3D tensor-structured representation that can be effectively learned by representation learning models as D-CNNs. Fourth, many different action classes share a great number of similar primitives, which interferes with action classification.…”
Section: Introductionmentioning
confidence: 99%
“…In the same regard, Ke et al [27] propose to transform a skeleton sequence into three video clips, the CNN characteristics of the three clips are then merged into a single characteristics vector, which is finally sent to a softmax function for classification. Pham et al [28] propose to use a residual network [29] with the transformed normalized skeleton in the RGB space as the input. Cao et al [30] propose to classify the image obtained thanks to gated convolutions.…”
Section: Convolutional Neural Network (Cnn)mentioning
confidence: 99%
“…We then classify these images using standard computer vision deep-learning methods, such as in [27][28][29][30][31], while preserving spatial and temporal relationships.…”
Section: Cartesian Coordinates Features Branchmentioning
confidence: 99%
“…One of the major challenges in exploiting D-CNNs for skeleton-based action recognition is how a skeleton sequence could be effectively represented and fed to the deep networks. As D-CNNs work well on still images [18], our idea therefore is to encode the spatial and temporal dynamics of skeletons into 2D images [28,29]. Two essential elements for describing an action are static poses and their temporal dynamics.…”
Section: Enhanced Skeleton Pose-motion Featurementioning
confidence: 99%