Ensemble One-Dimensional Convolution Neural Networks for Skeleton-Based Action Recognition

Xu, Yangyang; Cheng, Jun; Wang, Lei; Xia, Haiying; Liu, Feng; Tao, Dapeng

doi:10.1109/lsp.2018.2841649

Cited by 77 publications

(27 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CNN-based methods transform 3D-skeleton sequence data from a vector sequence to a pseudo-image and use the CNN network to extract the pseudo-image features [ 18 , 19 , 20 , 21 , 22 , 23 ]. The advantage of the CNN method lies in its powerful feature extraction capabilities, but it cannot handle the spatial and temporal relationships in skeleton information well.…”

Section: Related Workmentioning

confidence: 99%

“…With the continuous development of deep learning, many methods extract motion patterns to form a skeleton sequence and use RNNs to model them [ 13 , 14 , 15 , 16 , 17 ]. Other methods based on CNNs transform skeleton data into pseudo images and send them into CNNs for prediction [ 18 , 19 , 20 , 21 , 22 , 23 ]. However, these methods cannot capture the inherent spatial relationship between joints.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition

Zuo

Zou

Fan

et al. 2020

Sensors

View full text Add to dashboard Cite

Spatiotemporal graph convolution has made significant progress in skeleton-based action recognition in recent years. Most of the existing graph convolution methods take all the joints of the human skeleton as the overall modeling graph, ignoring the differences in the movement patterns of various parts of the human, and cannot well connect the relationship between the different parts of the human skeleton. To capture the unique features of different parts of human skeleton data and the correlation of different parts, we propose two new graph convolution methods: the whole graph convolution network (WGCN) and the part graph convolution network (PGCN). WGCN learns the whole scale skeleton spatiotemporal features according to the movement patterns and physical structure of the human skeleton. PGCN divides the human skeleton graph into several subgraphs to learn the part scale spatiotemporal features. Moreover, we propose an adaptive fusion module that combines the two features for multiple complementary adaptive fusion to obtain more effective skeleton features. By coupling these proposals, we build a whole and part adaptive fusion graph convolution neural network (WPGCN) that outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition

Zuo

Zou

Fan

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…Based on LSTM, Song et al [14] proposed a spatio-temporal attention model, which can automatically focus on the discriminative joints and pay different attention weights to each frame. CNN based methods: Some previous works have employed CNN for skeleton based action recognition and achieved great success [15][16][17][18][19][20]. Ke et al [19] represented the sequence as three clips for each channel of the 3D coordinates, which reflects the temporal information of the skeleton sequence and spatial relationship.…”

Section: Related Workmentioning

confidence: 99%

A Multi-Feature Representation of Skeleton Sequences for Human Interaction Recognition

Wang

Deng

2020

Electronics

View full text Add to dashboard Cite

Inspired from the promising performances achieved by recurrent neural networks (RNN) and convolutional neural networks (CNN) in action recognition based on skeleton, this paper presents a deep network structure which combines both CNN for classification and RNN to achieve attention mechanism for human interaction recognition. Specifically, the attention module in this structure is utilized to give various levels of attention to various frames by different weights, and the CNN is employed to extract the high-level spatial and temporal information of skeleton data. These two modules seamlessly form a single network architecture. In addition, to eliminate the impact of different locations and orientations, a coordinate transformation is conducted from the original coordinate system to the human-centric coordinate system. Furthermore, three different features are extracted from the skeleton data as the inputs of three subnetworks, respectively. Eventually, these subnetworks fed with different features are fused as an integrated network. The experimental result shows the validity of the proposed approach on two widely used human interaction datasets.

show abstract

“…Pushpajit et al [12 ] separately trained and combined 5‐CNN streams of different modalities to improve the recognition accuracy. Xu et al [14 ] proposed an Ensemble Neural Network (Ensem‐NN) to combine four different subnets based on designed one‐dimensional ConvNets with residual structure for skeleton‐based human action recognition. Ren et al [13 ] used a rank pooling mechanism to construct dynamic images and applied segment training strategy to obtain multiple ConvNets, and achieved better recognition performance by fusing learned features of different modalities.…”

Section: Related Workmentioning

confidence: 99%

Joint learning of convolution neural networks for RGB‐D‐based human action recognition

et al. 2020

Self Cite

View full text Add to dashboard Cite

RGB‐D‐based human action recognition aims to learn distinctive features from different modalities and has shown good progress in practice. However, it is difficult to improve the recognition performance through directly training multiple individual convolutional networks (ConvNets) and fusing features later because complmentary information between different modalities cannot be learned. To address this issue, this Letter proposes a single two‐stream ConvNets framework for multimodality learning that extract features through RGB and depth streams. Specifically, the authors first represent RGB‐D sequence to motion images as the inputs of the proposed ConvNets for obtaining spatial–temporal information. Then, a features fusion and joint training strategy is adapted to learn RGB‐D complementary features simultaneously. Experimental results on benchmark NTU RGB+D 120 dataset validate the effectiveness of the proposed framework and demonstrate that two‐stream ConvNets outperforms the current state‐of‐the‐art approaches.

show abstract

Ensemble One-Dimensional Convolution Neural Networks for Skeleton-Based Action Recognition

Cited by 77 publications

References 28 publications

Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition

Whole and Part Adaptive Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition

A Multi-Feature Representation of Skeleton Sequences for Human Interaction Recognition

Joint learning of convolution neural networks for RGB‐D‐based human action recognition

Contact Info

Product

Resources

About