2023
DOI: 10.3390/math11092115
|View full text |Cite
|
Sign up to set email alerts
|

SlowFast Multimodality Compensation Fusion Swin Transformer Networks for RGB-D Action Recognition

Abstract: RGB-D-based technology combines the advantages of RGB and depth sequences which can effectively recognize human actions in different environments. However, the spatio-temporal information between different modalities is difficult to effectively learn from each other. To enhance the information exchange between different modalities, we introduce a SlowFast multimodality compensation block (SFMCB) which is designed to extract compensation features. Concretely, the SFMCB fuses features from two independent pathwa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 51 publications
0
0
0
Order By: Relevance
“…UTKinect-Action3D [63], CAD-60 [64,65], LIRIS human activities Product score fusion (late fusion) [34] Visual (RGB) and depth dynamic images c-ConvNet ChaLearn LAP IsoGD [61], NTU RGB+D [53] Fusion score (late fusion) [33] RGB and depth frames Pre-trained VGG networks MSRDaily activity 3D [49], UTD-MHAD [51], CAD-60 [64,65] early, intermediate, and late fusion [13] RGB, depth, skeleton I3D and shift-GCN NTU RGB+D [53], SBU interaction [66] Cross-modality compensation block (intermediate) [42] RGB and depth ResNet and VGG with CMCB NTU RGB+D 120 [56], THU-READ [58], PKU-MMD [67] SlowFast multimodality compensation block (intermediate) [40] RGB & Depth features Swin transformer NTU RGB+D 120 [56], NTU RGB+D 60 [53], THU-READ [58], PKU-MMD [67] Cross-modality fusion transformer (intermediate) [41] RGB and depth features Restnet50 feature extractors NTU RGB+D 120 [56], THU-READ [58], PKU-MMD [67]…”
Section: Fusion Methods Fusion Inputs Encoders Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…UTKinect-Action3D [63], CAD-60 [64,65], LIRIS human activities Product score fusion (late fusion) [34] Visual (RGB) and depth dynamic images c-ConvNet ChaLearn LAP IsoGD [61], NTU RGB+D [53] Fusion score (late fusion) [33] RGB and depth frames Pre-trained VGG networks MSRDaily activity 3D [49], UTD-MHAD [51], CAD-60 [64,65] early, intermediate, and late fusion [13] RGB, depth, skeleton I3D and shift-GCN NTU RGB+D [53], SBU interaction [66] Cross-modality compensation block (intermediate) [42] RGB and depth ResNet and VGG with CMCB NTU RGB+D 120 [56], THU-READ [58], PKU-MMD [67] SlowFast multimodality compensation block (intermediate) [40] RGB & Depth features Swin transformer NTU RGB+D 120 [56], NTU RGB+D 60 [53], THU-READ [58], PKU-MMD [67] Cross-modality fusion transformer (intermediate) [41] RGB and depth features Restnet50 feature extractors NTU RGB+D 120 [56], THU-READ [58], PKU-MMD [67]…”
Section: Fusion Methods Fusion Inputs Encoders Datasetsmentioning
confidence: 99%
“…The authors in [39] propose a Transformer-based framework for egocentric action recognition, emphasizing inter-frame and mutual-attentional fusion. The authors in [40] contrast previous methods by integrating a SlowFast multimodality compensation block for processing depth and RGB sequences. The authors in [41] propose a dual-stream cross-modality fusion transformer, enhancing feature representations through self-attention and cross-attention mechanisms.…”
Section: Deep Learning-based Solutionsmentioning
confidence: 99%
“…• SlowFast (SF) (Xiao et al 2020;Feichtenhofer et al 2019) is a publicly available video-audio action recognition baseline.…”
Section: Experiments Experiments Detailsmentioning
confidence: 99%
“…Many existing works focus on mining domain-invariant features and thus blocking the impact of domain-variant features that hinder the generalization across domains. However, they are vulnerable to (Xiao et al 2020;Feichtenhofer et al 2019) in intra-domain and DG settings with Advertisement, Cartoon, Game, and Movie domains. r and e denote train and test sets.…”
Section: Introductionmentioning
confidence: 99%