2023
DOI: 10.3390/app13042058
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Abstract: Due to the great success of Vision Transformer (ViT) in image classification tasks, many pure Transformer architectures for human action recognition have been proposed. However, very few works have attempted to use Transformer to conduct bimodal action recognition, i.e., both skeleton and RGB modalities for action recognition. As proved in many previous works, RGB modality and skeleton modality are complementary to each other in human action recognition tasks. How to use both RGB and skeleton modalities for ac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 47 publications
0
2
0
Order By: Relevance
“…Based on the results of this comparison with existing state-of-the-art methods, our proposed NMA-GCN can effectively improve skeleton-based human action recognition performance and has good generalization ability. [20] RGB, Flow 76.4 TRN [69] RGB, Flow 79.8 TRNms [69] RGB, Flow 80.2 TSM [21] RGB, Flow 81.2 ST-GCN [11] Skeleton 85.1 RSANet [70] RGB 86.4 RGBSformer [71] RGB, Skeleton 86.7 MS-AAGCN [55] Skeleton 86.7 CTR-GCN [13] Skeleton 88.5…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Based on the results of this comparison with existing state-of-the-art methods, our proposed NMA-GCN can effectively improve skeleton-based human action recognition performance and has good generalization ability. [20] RGB, Flow 76.4 TRN [69] RGB, Flow 79.8 TRNms [69] RGB, Flow 80.2 TSM [21] RGB, Flow 81.2 ST-GCN [11] Skeleton 85.1 RSANet [70] RGB 86.4 RGBSformer [71] RGB, Skeleton 86.7 MS-AAGCN [55] Skeleton 86.7 CTR-GCN [13] Skeleton 88.5…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Research on the fusion of these two modalities suggests that the multimodal methods employed in these studies offer only slight enhancements over RGB-based HAR models [12], [14]. Furthermore, none of these studies [12], [14], [15], [16], [37], [42] assess the robustness of their proposed models when facing cross-dataset. Some limited research [14], [15], [16] has explored the integration of RGB and skeleton modalities using datasets containing fine-grained HOI, such as Toyota Smarthome [14], but their primary focus was not on improving the accuracy of fine-grained HOI.…”
Section: Introductionmentioning
confidence: 98%
“…1) Recognition of fine-grained HOI : Most existing HAR studies [12], [13], [36], [37], [42] that fuse skeleton and RGB modalities have mainly concentrated on recognizing broad interaction categories. These studies have assessed datasets like NTU RGB+D [27], which contain only coarse-grained HOI that can be accurately classified using high-quality skeleton modality features alone.…”
Section: Introductionmentioning
confidence: 99%
“…Vision-based human action recognition is a subject of intense study because of the decade's rewarding progress in the fields of artificial intelligence and computer vision. Knowing the amount of uniquely human behavior in each frame is a major highlight, and the data gleaned is a boon to detecting dangerous or falling actions [1]. It's useful for (a) variety of things, including (but not limited to) (b) ambient assisted living [2], (c) medical activities [3], and many more [4].…”
Section: Introductionmentioning
confidence: 99%