2022
DOI: 10.1109/access.2022.3201227
|View full text |Cite
|
Sign up to set email alerts
|

Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

Abstract: The RGB-D-based human action recognition is gaining increasing attention because the different modalities can provide complementary information. However, the recognition performance is still not satisfactory due to the limited ability to learn spatial-temporal feature and insufficient intermodel interaction. In this paper, we propose a novel approach for RGB-D human action recognition by aggregating spatial-temporal information and implementing cross-modality interactive learning. Firstly, a spatial-temporal i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 78 publications
0
7
0
Order By: Relevance
“…Qin Cheng et al [32] considered RGB and Depth data modalities and proposed a Spatio-temporal information aggregation module (SITAM) which utilizes CNNs to acquire Spatio-temporal information from input data.…”
Section: Hybrid Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Qin Cheng et al [32] considered RGB and Depth data modalities and proposed a Spatio-temporal information aggregation module (SITAM) which utilizes CNNs to acquire Spatio-temporal information from input data.…”
Section: Hybrid Methodsmentioning
confidence: 99%
“…Moreover, they were not provided local motion attributes to the HAR system which results in more misclassification at the action with similar movements. Even though Q. Cheng et al [32] adapted for temporal attention model along with cross model learning, they didn't reveal the inherent characteristics of actions with different speeds and time periods. Further, we can see that the single data model based HARs have gained very less recognition performance.…”
Section: Comparisonmentioning
confidence: 99%
“…Additionally, the compatibility of graph convolutional methods with skeleton-based action recognition led to widespread use in the literature (Chi et al 2022;Song et al 2022;Cheng et al 2020). RGBbased action recognition is more sparse in the literature due to its lack of 3D structure, usually only being used in addition to other modalities (Wang et al 2019;Cheng et al 2022;Das et al 2020). Contrary to these previous works, we demonstrate how our RGB-based model is able to achieve state-of-the-art multi-view action recognition performance, even over skeletal-based models.…”
Section: Multi-view Action Recognitionmentioning
confidence: 99%
“…Learning view-invariant representations for multi-view action recognition has also been explored before for both uniand multi-modal approaches (Li et al 2018b;Das and Ryoo 2023;Ji et al 2021;Bian et al 2023), and by using either solely convolutional or transformer-based architectures (Cheng et al 2022;Vyas, Rawat, and Shah 2020;Ji et al 2021). Our approach uses a hybrid architecture to limit each of these respective shortcomings, while also novelly introducing a specific configuration of queries for transformer decoders that facilitates feature disentanglement.…”
Section: Multi-view Action Recognitionmentioning
confidence: 99%
“…New fusion approaches are proposed in a group of studies (Cheng et al 2021;Zhou et al 2021;Tian et al 2020;Wang et al 2019a;Hampiholi et al 2023;Lee et al 2023;Cheng et al 2022). In Cheng et al (2021), a cross-modality compensation block (CMCB) is developed to learn the cross-modality complementary features from RGB and depth modalities.…”
Section: Rgb and Depthmentioning
confidence: 99%