2022
DOI: 10.1109/tpami.2022.3177813
|View full text |Cite
|
Sign up to set email alerts
|

MMNet: A Model-based Multimodal Network for Human Action Recognition in RGB-D Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(21 citation statements)
references
References 78 publications
0
21
0
Order By: Relevance
“…Year cs(%) cv (%) STA-LSTM [54] 2017 61.20 63.30 JCRRNN [55] 2016 64.60 66.90 Skeleton boxes [56] 2017 82.50 84.90 Li et al [57] 2017 86.80 94.20 HCN [58] 2018 85.90 87.60 TSMF [59] 2021 95.8 97.8 MMNet [60] 2022 97.4 98.6 (Ours) SAt-Object Integration -98.1 98.9…”
Section: Methodsmentioning
confidence: 99%
“…Year cs(%) cv (%) STA-LSTM [54] 2017 61.20 63.30 JCRRNN [55] 2016 64.60 66.90 Skeleton boxes [56] 2017 82.50 84.90 Li et al [57] 2017 86.80 94.20 HCN [58] 2018 85.90 87.60 TSMF [59] 2021 95.8 97.8 MMNet [60] 2022 97.4 98.6 (Ours) SAt-Object Integration -98.1 98.9…”
Section: Methodsmentioning
confidence: 99%
“…Li et al [16] approach to fusion, which is based on models, is distinct from other existing approaches to fusion at the model level that necessitate similarity in representation across multiple modalities and focuses on fusion with co-learning, utilizing a thorough comprehension of the data structure. Specifically, Yu et al [25] acquired representation from the RGB modality by emphasizing body parts that provided mutually complementary qualities to the skeleton modality which may limit the generalizability of the results to other data-sets or real-world scenarios.…”
Section: Multimodal Harmentioning
confidence: 99%
“…We employ view1 and view2 as the training set, and view3 as the validation set on N-UCLA dataset. There are three parts in this table: knowledge distillation methods [10], [24], [25], multi-modality fusion methods [4], [46], [47], [48], [49], and the proposed methods. Our FCKD(F+S+D) achieves 98.3% accuracy, outperforming the baseline 3D ResNeXt101 by 3.7%.…”
Section: E Comparison With State-of-the-artsmentioning
confidence: 99%
“…These results demonstrate that our method can learn view-invariant semantics by focusing on the focal channel features between modalities, improving the accuracy on the crossview N-UCLA dataset. Compared to multi-modality fusion methods [4], [46], [47], [48], [49], our FCKD(F+S+D) only uses the RGB modality in the testing phase, but obtains a superior accuracy to Hybrid [46], VPN [47], MMNet [48] and 3DV [4] by 5.2%, 4.8%, 4.6% and 3.0%. When we fuse FCKD(F+S+D) with unimodal optical flow, skeleton and depth, our FCKD(F+S+D)+F+S+D obtains the state-of-the-art accuracy of 98.9% which is comparable to Hierarchical [49].…”
Section: E Comparison With State-of-the-artsmentioning
confidence: 99%