2021
DOI: 10.1016/j.cviu.2021.103219
|View full text |Cite
|
Sign up to set email alerts
|

Skeleton-based action recognition via spatial and temporal transformer networks

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
139
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 217 publications
(139 citation statements)
references
References 47 publications
0
139
0
Order By: Relevance
“…Data-efficient ADL recognition. Despite the impressive improvements in general activity recognition [27], [28],…”
Section: Related Workmentioning
confidence: 99%
“…Data-efficient ADL recognition. Despite the impressive improvements in general activity recognition [27], [28],…”
Section: Related Workmentioning
confidence: 99%
“…For more than one person in the clip, each person is scored individually and we take the highest score of each person in the frame. As done in work [35], the number of heads of multi-head attention is set to 8, and the embedding dimensions of d q , d k , and d v in each layer are 0.25 × C out in all these experiments.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Cho et al [13] first applied self-attention [37] to the skeleton-based HAR problem. More recently, Plizzari et al [29], inspired by Bello et al [7], employed self-attention to overcome the locality of the convolutions, again adopting a two-stream ensemble method, where self-attention is applied on the temporal and spatial information, respectively.…”
Section: Related Workmentioning
confidence: 99%
“…Among those, MLSTM-FCN [21] combines convolutions, spacial attention, and an LSTM block, and its improved version ActionXPose [2] uses additional preprocessing, leading the model to exploit more correlations in data. MS-G3D [26] makes use of spatial-temporal graph convolutions to make the model aware of spatial relations between skeleton keypoints, while ST-TR [29] joins graph convolutions with Transformer-based self-attention applied both to space and time. As the last two solutions also propose a model ensemble solution, these results are further compared to AcT ensembles made of 2, 5, and 10 singleshot models.…”
Section: B Action Recognition On Mpose2021mentioning
confidence: 99%