2020
DOI: 10.3390/s20113305
|View full text |Cite
|
Sign up to set email alerts
|

A Hybrid Network for Large-Scale Action Recognition from RGB and Depth Modalities

Abstract: The paper presents a novel hybrid network for large-scale action recognition from multiple modalities. The network is built upon the proposed weighted dynamic images. It effectively leverages the strengths of the emerging Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches to specifically address the challenges that occur in large-scale action recognition and are not fully dealt with by the state-of-the-art methods. Specifically, the proposed hybrid network consists of a CNN … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(16 citation statements)
references
References 76 publications
0
16
0
Order By: Relevance
“…Table 5 gives the results of multi-modality fusion-based HAR methods on the MSRDailyActivity3D [284], UTD-MHAD [355], and NTU RGB+D [195] benchmark datasets. [360] Feature,Score ---c-ConvNet [58] Feature,Score --86.4 89.1 GMVAR [361] Score ----Dhiman et al [362] Score --79.4 84.1 Wang et al [363] Feature --89.5 91.7 Trear [364] Feature ----Ren et al [365] Score --89.7 93.0 CAPF [366] Feature,Score --94.2 97.3 ST-LSTM [8] RGB,S Feature --73.2 80.6 Chain-MS [367] Feature --80.8 -GRU+STA-Hands [368] Score --82.5 88.6 Zhao et al [369] Feature --83.7 93.7 Baradel et al [370] Score 90.0 -84.8 90.6 SI-MM [371] Feature 91.9 -92.6 97.9 Separable STA [372] Feature --92.2 94.6 SGM-Net [373] Score --89.1 95.9 VPN (RNX3D101) [374] Feature --95.5 98.0 Luvizon et al [375] Feature --89.9 -JOLO-GCN [376] Score --93.8 98.1 TP-ViT [377] Feature,Score ----RGBPose-Conv3D [378] Feature,Score --97.0 99.6 Rahmani et al [ RGB,S,D,PC Score ----Ardianto et al [388] RGB,D,IR Score ----FUSION-CPA [389] S,IR Feature --91.6 94.5 ActAR [390] S,IR Feature ----Wang et al [391] RGB, Au Feature,Score ----Owens et al [315] Feature ----TBN [25] Feature ----TSN+audio stream [25] Score ----AVSlowFast [317] Feature ----Gao et al [316] Feature ----MAFnet [392] Feature ----RNA-Net [393] Feature,Score ----IMD-B [394]…”
Section: Fusionmentioning
confidence: 99%
See 1 more Smart Citation
“…Table 5 gives the results of multi-modality fusion-based HAR methods on the MSRDailyActivity3D [284], UTD-MHAD [355], and NTU RGB+D [195] benchmark datasets. [360] Feature,Score ---c-ConvNet [58] Feature,Score --86.4 89.1 GMVAR [361] Score ----Dhiman et al [362] Score --79.4 84.1 Wang et al [363] Feature --89.5 91.7 Trear [364] Feature ----Ren et al [365] Score --89.7 93.0 CAPF [366] Feature,Score --94.2 97.3 ST-LSTM [8] RGB,S Feature --73.2 80.6 Chain-MS [367] Feature --80.8 -GRU+STA-Hands [368] Score --82.5 88.6 Zhao et al [369] Feature --83.7 93.7 Baradel et al [370] Score 90.0 -84.8 90.6 SI-MM [371] Feature 91.9 -92.6 97.9 Separable STA [372] Feature --92.2 94.6 SGM-Net [373] Score --89.1 95.9 VPN (RNX3D101) [374] Feature --95.5 98.0 Luvizon et al [375] Feature --89.9 -JOLO-GCN [376] Score --93.8 98.1 TP-ViT [377] Feature,Score ----RGBPose-Conv3D [378] Feature,Score --97.0 99.6 Rahmani et al [ RGB,S,D,PC Score ----Ardianto et al [388] RGB,D,IR Score ----FUSION-CPA [389] S,IR Feature --91.6 94.5 ActAR [390] S,IR Feature ----Wang et al [391] RGB, Au Feature,Score ----Owens et al [315] Feature ----TBN [25] Feature ----TSN+audio stream [25] Score ----AVSlowFast [317] Feature ----Gao et al [316] Feature ----MAFnet [392] Feature ----RNA-Net [393] Feature,Score ----IMD-B [394]…”
Section: Fusionmentioning
confidence: 99%
“…In [363], a hybrid network that consists of multi-stream CNNs and 3D ConvLSTMs [414] was introduced to extract features from RGB and depth videos. These features were then fused via canonical correlation analysis to perform action classification.…”
Section: Fusion Of Visual Modalitiesmentioning
confidence: 99%
“…There are two widely used multi-modality fusion schemes in HAR, namely, late fusion and early fusion. Generally, late fusion [362], [363] is decision-based, which integrates the decisions that are separately made based on different modalities, to produce the final classification result. As it is usually very convenient and effective to directly fuse the classification results (confidence scores) obtained based on different modalities, late fusion has been quite popularly adopted for HAR.…”
Section: Fusionmentioning
confidence: 99%
“…In the work of [361], a Generative Multi-View Action Recognition (GM-VAR) framework was introduced, which generated one view conditioned on the other views to make HAR more robust to cross-view settings. In the work of [363], a hybrid network of CNN and RNN was proposed, where weighted dynamic images were fed into CNNs, while RGB and depth sequences were fed into 3D ConvLSTMs to extract features. Then a canonical correlation analysis was applied to fuse these features.…”
Section: Fusion Of Visual Modalitiesmentioning
confidence: 99%
“…Recent advances in artificial intelligence technology and vision sensors have promoted vision-based action recognition for various applications, such as education [ 2 ], entertainment [ 3 , 4 ], and sports [ 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 ]. Various studies have proposed novel algorithms [ 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 ] or established datasets [ 1 , 24 , 25 , 26 , 27 ] for vision-based action recognition.…”
Section: Introductionmentioning
confidence: 99%