2019
DOI: 10.20944/preprints201912.0086.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Abstract: This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D CNNs, 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D c… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 7 publications
0
8
0
Order By: Relevance
“…Inspired by transfer learning, Stroud et al [150] proposed a Distilled 3D Network (D3D), consisting of a student network and a teacher network, where the student network is trained from RGB videos and also distilling knowledge from the teacher network trained from optical flow sequences. Besides, 2D and 3D CNNs are fused for HAR in [158].…”
Section: Deep Learning Methodsmentioning
confidence: 99%
“…Inspired by transfer learning, Stroud et al [150] proposed a Distilled 3D Network (D3D), consisting of a student network and a teacher network, where the student network is trained from RGB videos and also distilling knowledge from the teacher network trained from optical flow sequences. Besides, 2D and 3D CNNs are fused for HAR in [158].…”
Section: Deep Learning Methodsmentioning
confidence: 99%
“…[ 3 ]. Besides, a ResNet18 model used in previous studies has a considerably higher number of trainable parameters (11 M [ 30 ]) than our proposed models.…”
Section: Models Evaluationmentioning
confidence: 99%
“…For efficient learning of spatio-temporal features in video action recognition, a hybrid CNN was introduced in [ 174 ] used a fusion convolutional architecture. 2D and 3D CNN was fused to present temporal encoding with fewer parameters.…”
Section: Deep Learning Techniques Based Modelsmentioning
confidence: 99%
“… The performance of action recognition models as mentioned in [ 174 ]. Including: ( a ) Semi-CNN model based on VGG16 architecture ( b ) Semi-CNN model based on ResNet34 architecture ( c ) Semi-CNN model based on DenseNet121 architecture.…”
Section: Figurementioning
confidence: 99%