Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Kim, Jun-Hwa; Won, Chee Sun

doi:10.1109/access.2020.2983427

Cited by 25 publications

(42 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first 3D-CNN for HAR has been introduced by [36][37][38] providing an average accuracy of 91 percent. Recent researches based on 3D-CNN techniques [39][40][41][42] have obtained a high performance on the KTH dataset [43] in comparison to 2D-CNN networks [44][45][46][47]. Yet, the maximum accuracy of this research is reported to be at 98.5 percent but is not capable of classifying in real-time.…”

Section: Human Action Recognition (Har)mentioning

confidence: 78%

A Mixed-Perception Approach for Safe Human–Robot Collaboration in Industrial Automation

Amin

Rezayati

Venn

et al. 2020

Sensors

View full text Add to dashboard Cite

Digital-enabled manufacturing systems require a high level of automation for fast and low-cost production but should also present flexibility and adaptiveness to varying and dynamic conditions in their environment, including the presence of human beings; however, this presence of workers in the shared workspace with robots decreases the productivity, as the robot is not aware about the human position and intention, which leads to concerns about human safety. This issue is addressed in this work by designing a reliable safety monitoring system for collaborative robots (cobots). The main idea here is to significantly enhance safety using a combination of recognition of human actions using visual perception and at the same time interpreting physical human–robot contact by tactile perception. Two datasets containing contact and vision data are collected by using different volunteers. The action recognition system classifies human actions using the skeleton representation of the latter when entering the shared workspace and the contact detection system distinguishes between intentional and incidental interactions if physical contact between human and cobot takes place. Two different deep learning networks are used for human action recognition and contact detection, which in combination, are expected to lead to the enhancement of human safety and an increase in the level of cobot perception about human intentions. The results show a promising path for future AI-driven solutions in safe and productive human–robot collaboration (HRC) in industrial automation.

show abstract

Section: Human Action Recognition (Har)mentioning

confidence: 78%

A Mixed-Perception Approach for Safe Human–Robot Collaboration in Industrial Automation

Amin

Rezayati

Venn

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…For instance, [34] developed a long-term recurrent neural network using a deep hierarchical feature extractor with LSTM networks to synthesize temporal dynamics for visual recognition and description; [13] learned video representations using neural networks with long-term temporal convolutions to model actions at full temporal extent; [35] tried to adaptively identify key features of actions in videos for every time-step prediction of RNN by reinforcing LSTM with a spatial-temporal attention module; [7] proposed an attention-based bidirectional LSTM method for video analysis. Moreover, Wang et al [36] modeled long-range temporal structure with segment-based sampling and aggregation strategy; Kim et al [2] employed stacked gray-scale 3-channel image to fine-tune the pre-trained 2D CNN for the temporal stream in videos. Furthermore, there exist successful attempts of directly applying 3D CNN convolutional networks to action recognition, since 3D filters can learn spatiotemporal representation from raw videos [14], [37]- [39].…”

Section: Related Workmentioning

confidence: 99%

“…Parameters are set as suggested in original papers or github web sites, i.e., learning rate 0.001, momentum 0.9, learning patience 10, 5, 10 respectively, learning rate decay 0.1, weight decay 0.001, 0.0001, 0.001 respectively, sample duration 16, the backbone net being ResNet152 except C3D. Moreover, recently proposed ABi-LSTM [7], SG3I [2], and the compressed method CoViAR [5] originally used for action recognition are introduced for comparison; their parameters are set to default as indicated in their papers and github web site. For our RCCN method, we use temporal segments to capture variable-length dependencies among frames: during training segment size is 5 and set to 25 in testing; the other parameters are shown in Table 4.…”

Section: State-of-the-art Alternatives Comparisonmentioning

confidence: 99%

Recurrent Compressed Convolutional Networks for Short Video Event Detection

2020

IEEE Access

View full text Add to dashboard Cite

Short videos are popular information carriers on the Internet, and detecting events from them can well benefit widespread applications, e.g., video browsing, management, retrieval and recommendation. Existing video analysis methods always require decoding all frames of videos in advance, which is very costly in time and computation power. These short videos are often untrimmed, noisy and even incomplete, adding much difficulty to event analysis. Unlike previous works focusing on actions, we target short video event detection and propose Recurrent Compressed Convolutional Networks (RCCN) for discovering the underlying event patterns within short videos possibly including a large proportion of non-event videos. Instead of using the whole videos, RCCN performs representation learning at much lower cost within the compressed domain where the encoded motion information reflecting the spatial relations among frames can be easily obtained to capture dynamic tendency of event videos. This alleviates the information incompleteness problem that frequently emerges in user-generated short videos. In particular, RCCN leverages convolutional networks as the backbone and the Long Short-Term Memory components to model the variable-range temporal dependency among untrimmed video frames. RCCN not only learns the common representation shared by the short videos of the same event, but also obtains the discriminative ability to detect dissimilar videos. We benchmark the model performance on a set of short videos generated from publicly available event detection database YLIMED, and compare RCCN with several baselines and state-of-the-art alternatives. Empirical studies have verified the preferable performance of RCCN. INDEX TERMS Compressed domain, event analysis, recurrent neural networks, short video event detection, temporal dependency.

show abstract

“…However, the size of I3D is still R c×d×d×T , which makes no change in the inference complexity. Recently, in [10], it has been shown that a video recognition can be done by using pre-trained 2D CNNs only. That is, a pre-trained CNN is fine-tuned by 3 grayscale frames, which are subsampled from a video shot.…”

Section: Introductionmentioning

confidence: 99%

“…That is, a pre-trained CNN is fine-tuned by 3 grayscale frames, which are subsampled from a video shot. Then, the selected 3 grayscale images among multiple video frames form a SG3I (Stacked Grayscale 3-channel Image) [10], which is compatible with the color image with RGB (Red, Green, Blue) channels. Then, the SG3Is formed from the training videos are used to fine-tune the pre-trained 2D CNN to learn the motion information.…”

Section: Introductionmentioning

confidence: 99%

Deep Edge Computing for Videos

Kim

Won³

2021

IEEE Access

Self Cite

View full text Add to dashboard Cite

This paper provides a modular architecture with deep neural networks as a solution for realtime video analytics in an edge-computing environment. The modular architecture consists of two networks of Front-CNN (Convolutional Neural Network) and Back-CNN, where we adopt Shallow 3D CNN (S3D) as the Front-CNN and a pre-trained 2D CNN as the Back-CNN. The S3D (i.e., the Front CNN) is in charge of condensing a sequence of video frames into a feature map with three channels. That is, the S3D takes a set of sequential frames in the video shot as input and yields a learned 3 channel feature map (3CFM) as output. Since the 3CFM is compatible with the three-channel RGB color image format, we can use the output of the S3D (i.e., the 3CFM) as the input to a pre-trained 2D CNN of the Back-CNN for the transfer-learning. This serial connection of Front-CNN and Back-CNN architecture is end-to-end trainable to learn both spatial and temporal information of videos. Experimental results on the public datasets of UCF-Crime and UR-Fall Detection show that the proposed S3D-2DCNN model outperforms the existing methods and achieves state-of-the-art performance. Moreover, since our Front-CNN and Back-CNN modules have a shallow S3D and a light-weighted 2D CNN, respectively, it is suitable for real-time video recognition in edge-computing environments. We have implemented our CNN model on NVIDIA Jetson Nano Developer as an edge-computing device to show its real-time execution.

show abstract

Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Cited by 25 publications

References 26 publications

A Mixed-Perception Approach for Safe Human–Robot Collaboration in Industrial Automation

A Mixed-Perception Approach for Safe Human–Robot Collaboration in Industrial Automation

Recurrent Compressed Convolutional Networks for Short Video Event Detection

Deep Edge Computing for Videos

Contact Info

Product

Resources

About