Skeleton-based action recognition via spatial and temporal transformer networks

Plizzari, Chiara; Cannici, Marco; Matteucci, Matteo

doi:10.1016/j.cviu.2021.103219

Cited by 217 publications

(139 citation statements)

References 47 publications

Supporting

Mentioning

139

Contrasting

Order By: Relevance

“…Data-efficient ADL recognition. Despite the impressive improvements in general activity recognition [27], [28],…”

Section: Related Workmentioning

confidence: 99%

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Peng¹,

Roitberg²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatically understanding human behaviour allows household robots to identify the most critical needs and plan how to assist the human according to the current situation. However, the majority of such methods are developed under the assumption that a large amount of labelled training examples is available for all concepts-of-interest. Robots, on the other hand, operate in constantly changing unstructured environments, and need to adapt to novel action categories from very few samples. Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays and then used as input to convolutional neural networks. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning. Inspired by recent success of feature enhancement methods in semi-supervised learning, we further introduce PROFORMER -an improved training strategy which uses soft-attention applied on iteratively estimated action category PROtotypes used to augment the embeddings and compute an auxiliary consistency loss. Extensive experiments consistently demonstrate the effectiveness of our approach for one-shot recognition from body poses, achieving state-of-the-art results on multiple datasets and surpassing the best published approach on the challenging NTU-120 one-shot benchmark by 1.84%. Our code will be made publicly available at https: //github.com/KPeng9510/ProFormer.

show abstract

“…Data-efficient ADL recognition. Despite the impressive improvements in general activity recognition [27], [28],…”

Section: Related Workmentioning

confidence: 99%

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Peng¹,

Roitberg²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For more than one person in the clip, each person is scored individually and we take the highest score of each person in the frame. As done in work [35], the number of heads of multi-head attention is set to 8, and the embedding dimensions of d q , d k , and d v in each layer are 0.25 × C out in all these experiments.…”

Section: Implementation Detailsmentioning

confidence: 99%

A Self-Attention Augmented Graph Convolutional Clustering Networks for Skeleton-Based Video Anomaly Behavior Detection

Liu

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

In this paper, we propose a new method for detecting abnormal human behavior based on skeleton features using self-attention augment graph convolution. The skeleton data have been proved to be robust to the complex background, illumination changes, and dynamic camera scenes and are naturally constructed as a graph in non-Euclidean space. Particularly, the establishment of spatial temporal graph convolutional networks (ST-GCN) can effectively learn the spatio-temporal relationships of Non-Euclidean Structure Data. However, it only operates on local neighborhood nodes and thereby lacks global information. We propose a novel spatial temporal self-attention augmented graph convolutional networks (SAA-Graph) by combining improved spatial graph convolution operator with a modified transformer self-attention operator to capture both local and global information of the joints. The spatial self-attention augmented module is used to understand the intra-frame relationships between human body parts. As far as we know, we are the first group to utilize self-attention for video anomaly detection tasks by enhancing spatial temporal graph convolution. Moreover, to validate the proposed model, we performed extensive experiments on two large-scale publicly standard datasets (i.e., ShanghaiTech Campus and CUHK Avenue datasets) which reveal the state-of-art performance for our proposed approach when compared to existing skeleton-based methods and graph convolution methods.

show abstract

“…Cho et al [13] first applied self-attention [37] to the skeleton-based HAR problem. More recently, Plizzari et al [29], inspired by Bello et al [7], employed self-attention to overcome the locality of the convolutions, again adopting a two-stream ensemble method, where self-attention is applied on the temporal and spatial information, respectively.…”

Section: Related Workmentioning

confidence: 99%

“…Among those, MLSTM-FCN [21] combines convolutions, spacial attention, and an LSTM block, and its improved version ActionXPose [2] uses additional preprocessing, leading the model to exploit more correlations in data. MS-G3D [26] makes use of spatial-temporal graph convolutions to make the model aware of spatial relations between skeleton keypoints, while ST-TR [29] joins graph convolutions with Transformer-based self-attention applied both to space and time. As the last two solutions also propose a model ensemble solution, these results are further compared to AcT ensembles made of 2, 5, and 10 singleshot models.…”

Section: B Action Recognition On Mpose2021mentioning

confidence: 99%

Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Mazzia,

Angarano,

Salvetti

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully selfattentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time short-time human action recognition. Extensive experimentation on MPOSE2021 with our proposed methodology and several previous architectural solutions proves the effectiveness of the AcT model and poses the base for future work on HAR.

show abstract

Skeleton-based action recognition via spatial and temporal transformer networks

Cited by 217 publications

References 47 publications

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

A Self-Attention Augmented Graph Convolutional Clustering Networks for Skeleton-Based Video Anomaly Behavior Detection

Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Contact Info

Product

Resources

About