2021
DOI: 10.48550/arxiv.2109.02860
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hierarchical Graph Convolutional Skeleton Transformer for Action Recognition

Abstract: Graph convolutional networks (GCNs) achieve promising performance for skeleton-based action recognition. However, restricted by the neighborhood of feature aggregation, the spatial-temporal graph convolution in most GCNs lacks the flexibility of feature extraction. In this paper, we present a novel architecture, named Graph Convolutional skeleton Transformer (GCsT), which can flexibly capture local-global contexts. In GCsT, Transfomer and GCNs play complementary roles for feature representations. Specially, hi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(21 citation statements)
references
References 19 publications
0
21
0
Order By: Relevance
“…Following this trend, a high number of models has been proposed, which are designed to achieve high accuracy [44], resource-efficiency [45], [46] or are tailored for specific tasks, such as object detection [47] or semantic segmentation [22]. While multiple works leverage visual transformer models in conventional video-based human activity classification [19], [48], [49], or standard sequence transformers for skeleton encodings [28], [50], [51], their potential of visual transformers as data-efficient encoders of body movement cast as images has not been considered yet and is the main motivation of our work. Furthermore, inspired by the recent success of feature augmentation method in semi-supervised learning [24], we for the first time propose training a visual transformer with an additional auxiliary branch for augmenting the embeddings using category-specific prototypes and self-attention.…”
Section: Related Workmentioning
confidence: 99%
“…Following this trend, a high number of models has been proposed, which are designed to achieve high accuracy [44], resource-efficiency [45], [46] or are tailored for specific tasks, such as object detection [47] or semantic segmentation [22]. While multiple works leverage visual transformer models in conventional video-based human activity classification [19], [48], [49], or standard sequence transformers for skeleton encodings [28], [50], [51], their potential of visual transformers as data-efficient encoders of body movement cast as images has not been considered yet and is the main motivation of our work. Furthermore, inspired by the recent success of feature augmentation method in semi-supervised learning [24], we for the first time propose training a visual transformer with an additional auxiliary branch for augmenting the embeddings using category-specific prototypes and self-attention.…”
Section: Related Workmentioning
confidence: 99%
“…Secondly, the transformer structure also benefits a large number of downstream tasks, e.g., semantic segmentation [ 33 ], remote sensing image classification [ 34 , 35 , 36 ] and behavior analysis [ 37 , 38 , 39 ]. However, in tasks such as semantic segmentation and remote sensing image classification, the contribution of a transformer structure is still limited to its advantage in visual features extraction.…”
Section: Related Workmentioning
confidence: 99%
“…On the contrary, in behavior analysis, due to the similarity between video frames and image patches (both are parts of the whole video stream or image), the transformer structure is introduced to exploit the temporal information among these video frames [ 38 ]. Similarly, a transformer is also employed to exploit the features from the pose skeleton in order to recognize human actions [ 39 ].…”
Section: Related Workmentioning
confidence: 99%
“…Note that both ST-TR and KA-AGTN still extract low-level features through a standard ST-GCN. Bai's work [2] introduced a Hierarchical Graph Convolutional skeleton Transformer (HGCT) based on a disentangled spatiotemporal transformer (DSTT) block to overcome neighborhood constraints and entangled spatiotemporal feature representations of GCNs. Different from ST-GCN, the adjacency matrix of DSTT is parameterized and updated in the training process.…”
Section: Skeleton-based Graph Transformer Networkmentioning
confidence: 99%
“…𝐸 π‘π‘œπ‘  retains the order of inputs which is widely used in TFNs [13,14,27,59,65], and 𝐽 π‘π‘œπ‘  is shared by the same joint between child and therapist. The motivation for introducing 𝐽 π‘π‘œπ‘  is to adapt and modify TFNs to accommodate dyadic skeleton inputs, since current skeleton-based TFNs only accept one single skeleton as their input [2,40,47]. We suggest that 𝐽 π‘π‘œπ‘  can help TFNs better attend to the same joint across two persons, which is critical in deciding movement synchrony.…”
Section: Spatial Transformermentioning
confidence: 99%