2022
DOI: 10.1007/s00530-022-00961-3
|View full text |Cite
|
Sign up to set email alerts
|

Multi-head attention-based two-stream EfficientNet for action recognition

Abstract: Recent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some roughly similar actions in videos such as sneezing and yawning. To solve this problem, we propose a Multi-head Attention-based Two-stream EfficientNet (MAT-EffNet) for action recognition, which can take advantage of the efficient feature extraction of EfficientNet. T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(7 citation statements)
references
References 72 publications
0
7
0
Order By: Relevance
“…Semi-supervised temporal gradient learning [57] 2022 75.9 BS-2SCN [41] 2022 71.3 ViT + Multi Layer LSTM [45] 2022 73.7 MAT-EffNet [58] 2023 70.9 DA-R3DCNN (Proposed) 2023 82.5…”
Section: Methods Year Accuracy (%)mentioning
confidence: 99%
See 1 more Smart Citation
“…Semi-supervised temporal gradient learning [57] 2022 75.9 BS-2SCN [41] 2022 71.3 ViT + Multi Layer LSTM [45] 2022 73.7 MAT-EffNet [58] 2023 70.9 DA-R3DCNN (Proposed) 2023 82.5…”
Section: Methods Year Accuracy (%)mentioning
confidence: 99%
“…Among the comparative methods, Multi-task hierarchical clustering [33] achieved the lowest accuracy of 51.4% on the HMDB51 dataset. Other comparative methods included STPP + LSTM [46], TSN [48], Deep autoencoder [35], TS-LSTM + temporal-inception [50], HATNet [51], Correlational CNN + LSTM [52], STDAN [53], DB-LSTM + SSPF [54], DS-GRU [39], TCLC [55], Semi-supervised temporal gradient learning [57], BS-2SCN [41], ViT + Multi Layer LSTM [45], and MAT-EffNet [58]. These methods achieved accuracies of 70.5%, 72.2%, 70.7%, 58.6%, 70.3%, 69.0%, 74.8%, 66.2%, 56.5%, 75.1%, 72.3%, 71.5%, 75.9%, 71.3%, 73.7%, and 70.9%, respectively.…”
Section: Methods Year Accuracy (%)mentioning
confidence: 99%
“…It extracts static features from partial facial images and dynamic features from partial facial optical flow, feeding them into a dual-stream neural network for feature fusion and classification, showcasing excellent performance. Zhou et al [46] harnessed EfficientNet's efficient feature extraction abilities to separately extract spatial and temporal features of consecutive video frames from spatial and temporal flows. They then employed a multi-head attention mechanism to capture pivotal action details, facilitating action recognition using the amalgamated features.…”
Section: Deep Multi-path Networkmentioning
confidence: 99%
“…The two-stream convolutional network [16] was proposed, which had two feature extraction networks, one using optical flow and the other using RGB. Along with CNN, a number of much stronger networks appeared, such as TSN [17], TSM [18] and R(2+1)D [19] etc [20,21]. Although 2D-CNN has smaller model parameters and higher accuracy for image data, they cannot obtain more effective information of video data.…”
Section: Related Workmentioning
confidence: 99%