2022 22nd International Conference on Control, Automation and Systems (ICCAS) 2022
DOI: 10.23919/iccas55662.2022.10003914
|View full text |Cite
|
Sign up to set email alerts
|

Multi-modal Transformer for Indoor Human Action Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…As another early fusion method, early concatenation entails concatenating token embedding sequences from multiple modalities and inputting them into transformer layers. Studies in [36], [37], and [42] employed this fusion method with RGB and skeleton sequences for HAR. Early concatenation offers the benefit of relative simplicity compared to other methods.…”
Section: B Multimodal Fusion For Har 1) Transformer-based Multimodali...mentioning
confidence: 99%
See 1 more Smart Citation
“…As another early fusion method, early concatenation entails concatenating token embedding sequences from multiple modalities and inputting them into transformer layers. Studies in [36], [37], and [42] employed this fusion method with RGB and skeleton sequences for HAR. Early concatenation offers the benefit of relative simplicity compared to other methods.…”
Section: B Multimodal Fusion For Har 1) Transformer-based Multimodali...mentioning
confidence: 99%
“…1) Recognition of fine-grained HOI : Most existing HAR studies [12], [13], [36], [37], [42] that fuse skeleton and RGB modalities have mainly concentrated on recognizing broad interaction categories. These studies have assessed datasets like NTU RGB+D [27], which contain only coarse-grained HOI that can be accurately classified using high-quality skeleton modality features alone.…”
Section: Introductionmentioning
confidence: 99%
“…The video backbone consists of 3D CNNs to extract spatio-temporal features, and the pose backbone contains a spatio-temporal GCN. Do et al (2022) propose a Multimodal Transformer (MMT) to use RGB and skeleton data of only eight input frames. Using the transformer-based structure, MMT can capture the correlation between non-local joints in skeleton data modality.…”
Section: Single-stream Various Papers Introduce Skeleton As Attention...mentioning
confidence: 99%