Towards To-a-T Spatio-Temporal Focus for Skeleton-Based Action Recognition

Ke, Lipeng; Peng, Kuan-Chuan; Lyu, Siwei

doi:10.1609/aaai.v36i1.19998

Cited by 33 publications

(7 citation statements)

References 37 publications

(85 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, compared to GCN-based methods, the performance of our FG-STFormer is also at the top. It compares favourably with current state-of-the-art STF [17] and CTR-GCN [5] on NTU-60 and NTU-120, and even outperforms the latter on NW-UCLA by 0.5%, verifying the effectiveness of FG-STFormer.…”

Section: Comparison With the State-of-the-artsmentioning

confidence: 62%

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Gao¹,

Wang²,

Lv³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile, the interactions between the focal joints and body parts are incorporated to enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer: focal and global temporal transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts, which is found to be vital important to make temporal transformer work. Extensive experimental results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our FG-STFormer surpasses all existing transformer-based methods, and compares favourably with state-of-the-art GCN-based methods.

show abstract

Section: Comparison With the State-of-the-artsmentioning

confidence: 62%

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Gao¹,

Wang²,

Lv³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…ST-GCN obtains impressive recognition improvement, however, the predefined spatial graph makes it difficult to capture abstract relationships between distant joints. To overcome this limitation, many methods [14,22,23] are proposed with various kinds of learnable graph modules or attention mechanisms. Shi et al [13] propose the Multi-stream Attention-enhanced Adaptive Graph Convolutional Network (MS-AAGCN).…”

Section: Skeleton-based Action Recognitionmentioning

confidence: 99%

“…GCN [6] is the first GCN-based recognition model. After that, more kinds of improvements [21][22][23] are proposed with various adaptive graph modules or attention mechanisms.…”

mentioning

confidence: 99%

Overcomplete graph convolutional denoising autoencoder for noisy skeleton action recognition

Guo,

Ji,

Shan

2023

IET Image Processing

View full text Add to dashboard Cite

Current skeleton‐based action recognition methods usually assume the input skeleton is complete and noise‐free. However, it is inevitable that the captured skeletons are incomplete due to occlusions or noisy due to changes in the environment. When dealing with these data, even State Of The Art (SOTA) recognition backbones experience significant degradation in recognition accuracy. Though a few methods have been proposed to address this issue, they still lack flexibility, efficiency and interpretability. In this work, an overcomplete Graph Convolutional Denoising Autoencoder (GCDAE) is proposed which can act as a flexible preprocessing module for pretrained recognition backbones and improve their robustness. Taking advantages of the overcomplete and fully graph convolutional structure, GCDAE is able to rectify noisy joints while keeping information of unspoiled details efficiently. On two large scale skeleton datasets NTU RGB+D 60 and 120, the introducing of GCDAE brings significant robustness improvements to SOTA backbones towards different types of noises.

show abstract

“…InfoGCN [ 8 ] develops an information rate-based framework for learning objectives. STF [ 10 ] provides a flexible framework for learning spatio-temporal gradients for skeleton-based action recognition. However, many advanced methods spend a great deal of effort on spatial feature extraction, ignoring the extraction of temporal features, and simply use the same multi-scale feature extraction module in each layer of the network.…”

Section: Related Workmentioning

confidence: 99%

“…GCN modules close to the network output tend to have larger receptive fields and can capture more contextual information. In addition, the receptive field of the GCN module close to the network input is relatively small [ 10 ]. It can be concluded that different layers have different effects on skeleton recognition.…”

Section: Introductionmentioning

confidence: 99%

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition

Zhang

et al. 2023

Sensors

View full text Add to dashboard Cite

Graph convolutional networks are widely used in skeleton-based action recognition because of their good fitting ability to non-Euclidean data. While conventional multi-scale temporal convolution uses several fixed-size convolution kernels or dilation rates at each layer of the network, we argue that different layers and datasets require different receptive fields. We use multi-scale adaptive convolution kernels and dilation rates to optimize traditional multi-scale temporal convolution with a simple and effective self attention mechanism, allowing different network layers to adaptively select convolution kernels of different sizes and dilation rates instead of being fixed and unchanged. Besides, the effective receptive field of the simple residual connection is not large, and there is a great deal of redundancy in the deep residual network, which will lead to the loss of context when aggregating spatio-temporal information. This article introduces a feature fusion mechanism that replaces the residual connection between initial features and temporal module outputs, effectively solving the problems of context aggregation and initial feature fusion. We propose a multi-modality adaptive feature fusion framework (MMAFF) to simultaneously increase the receptive field in both spatial and temporal dimensions. Concretely, we input the features extracted by the spatial module into the adaptive temporal fusion module to simultaneously extract multi-scale skeleton features in both spatial and temporal parts. In addition, based on the current multi-stream approach, we use the limb stream to uniformly process correlated data from multiple modalities. Extensive experiments show that our model obtains competitive results with state-of-the-art methods on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets.

show abstract

Towards To-a-T Spatio-Temporal Focus for Skeleton-Based Action Recognition

Cited by 33 publications

References 37 publications

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Overcomplete graph convolutional denoising autoencoder for noisy skeleton action recognition

Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition

Contact Info

Product

Resources

About