2020
DOI: 10.1109/access.2020.2983427
|View full text |Cite
|
Sign up to set email alerts
|

Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Abstract: A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames se… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
40
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 25 publications
(42 citation statements)
references
References 26 publications
2
40
0
Order By: Relevance
“…The first 3D-CNN for HAR has been introduced by [36][37][38] providing an average accuracy of 91 percent. Recent researches based on 3D-CNN techniques [39][40][41][42] have obtained a high performance on the KTH dataset [43] in comparison to 2D-CNN networks [44][45][46][47]. Yet, the maximum accuracy of this research is reported to be at 98.5 percent but is not capable of classifying in real-time.…”
Section: Human Action Recognition (Har)mentioning
confidence: 78%
“…The first 3D-CNN for HAR has been introduced by [36][37][38] providing an average accuracy of 91 percent. Recent researches based on 3D-CNN techniques [39][40][41][42] have obtained a high performance on the KTH dataset [43] in comparison to 2D-CNN networks [44][45][46][47]. Yet, the maximum accuracy of this research is reported to be at 98.5 percent but is not capable of classifying in real-time.…”
Section: Human Action Recognition (Har)mentioning
confidence: 78%
“…For instance, [34] developed a long-term recurrent neural network using a deep hierarchical feature extractor with LSTM networks to synthesize temporal dynamics for visual recognition and description; [13] learned video representations using neural networks with long-term temporal convolutions to model actions at full temporal extent; [35] tried to adaptively identify key features of actions in videos for every time-step prediction of RNN by reinforcing LSTM with a spatial-temporal attention module; [7] proposed an attention-based bidirectional LSTM method for video analysis. Moreover, Wang et al [36] modeled long-range temporal structure with segment-based sampling and aggregation strategy; Kim et al [2] employed stacked gray-scale 3-channel image to fine-tune the pre-trained 2D CNN for the temporal stream in videos. Furthermore, there exist successful attempts of directly applying 3D CNN convolutional networks to action recognition, since 3D filters can learn spatiotemporal representation from raw videos [14], [37]- [39].…”
Section: Related Workmentioning
confidence: 99%
“…Parameters are set as suggested in original papers or github web sites, i.e., learning rate 0.001, momentum 0.9, learning patience 10, 5, 10 respectively, learning rate decay 0.1, weight decay 0.001, 0.0001, 0.001 respectively, sample duration 16, the backbone net being ResNet152 except C3D. Moreover, recently proposed ABi-LSTM [7], SG3I [2], and the compressed method CoViAR [5] originally used for action recognition are introduced for comparison; their parameters are set to default as indicated in their papers and github web site. For our RCCN method, we use temporal segments to capture variable-length dependencies among frames: during training segment size is 5 and set to 25 in testing; the other parameters are shown in Table 4.…”
Section: State-of-the-art Alternatives Comparisonmentioning
confidence: 99%
“…However, the size of I3D is still R c×d×d×T , which makes no change in the inference complexity. Recently, in [10], it has been shown that a video recognition can be done by using pre-trained 2D CNNs only. That is, a pre-trained CNN is fine-tuned by 3 grayscale frames, which are subsampled from a video shot.…”
Section: Introductionmentioning
confidence: 99%
“…That is, a pre-trained CNN is fine-tuned by 3 grayscale frames, which are subsampled from a video shot. Then, the selected 3 grayscale images among multiple video frames form a SG3I (Stacked Grayscale 3-channel Image) [10], which is compatible with the color image with RGB (Red, Green, Blue) channels. Then, the SG3Is formed from the training videos are used to fine-tune the pre-trained 2D CNN to learn the motion information.…”
Section: Introductionmentioning
confidence: 99%