Deep Learning for Video Classification: A Review

Rehman, Atiq Ur; Belhaouari, Samir Brahim

doi:10.36227/techrxiv.15172920.v1

Cited by 12 publications

(10 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This literature review surveys strategies such as resnet50 backend with LSTM integration, highlighting the importance of efficiency in achieving competitive detection accuracy. Atiq ur Rehman et al describes [2] This paper presents a comprehensive review of video classification, emphasizing the recent success of deep learning models in this field. It addresses the limitations of existing reviews, focusing on network architecture, evaluation criteria, and datasets.…”

Section: IImentioning

confidence: 99%

Spatiotemporal Modeling for Dynamic Gesture Recognition in Video Streams

Sandhya,

ANITHASHEELA

2024

Preprint

View full text Add to dashboard Cite

Indian Sign Language (ISL) serves as a vital means of communication for the hearing-impaired community in India. Accurate recognition of ISL gestures through computer vision is of paramount importance for enhancing accessibility and inclusivity. Hence this research focuses on translating sign language gestures used by the hearing impairment community into formats understandable by the general population in order to bridge the communication gap between these communities. For this a Continuous Sign language recognition module is to be designed which is complicated since the grammar for Sign language is different from the spoken language due to which first the continuous ISL (Indian Sign language) is to be converted to glosses and then these glosses are to be used for generating the spoken language. Also, as per literature it is observed that Sign language translator is built for American, Chinese and Argentina Sign language but very little work is done on Indian Sign language. Also, many of the ISL translators are built either on static data or very a smaller number of gestures of video data [20]. In our work it is proposed to build a system which uses combinational network that can convert directly the ISL to Speech on 76 video gestures. The proposed combinational network includes Pre-trained network designs such as ResNet18, ResNet50, GoogLeNet, and InceptionV3 to efficiently extract spatial features from video frames and subsequently, these extracted features are further processed through a two-layer Long Short-Term Memory (Bi-LSTM) network to represent time dependencies between the frames for a particular gesture. Compared to conventional RNNs, BiLSTM models are used since they were able to represent well the longer time dependencies of frames in the gesture. To validate the proposed idea a standard balanced database of around 76 gestures with each gesture enacted by 10 individuals 05 times each which includes letters, words, phrases are created in Anechoic Chamber lab using Sony HXR-NX100 camera sponsored under UGC-MRP at JNTUHCEH. We explored various combinations of pre-trained networks and BiLSTM layers to strike a balance between computational resources and precision on this database and we could achieve incredible accuracy in gesture classification while minimizing training time and memory usage. GoogLenet with LSTM gave better results with an average test accuracy of 94.21% compared to other combinational networks.

show abstract

Section: IImentioning

confidence: 99%

Spatiotemporal Modeling for Dynamic Gesture Recognition in Video Streams

Sandhya,

ANITHASHEELA

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Specifically, the topic has gained more attention after the emergence of deep learning models as a successful tool for video classification. The reader may refer to [15] for more information on the state-of-the-art on video classification literature.…”

Section: Related Workmentioning

confidence: 99%

Sequence models for drone versus bird classification

Akyon,

Akagunduz,

Altinuc

et al. 2024

Sixteenth International Conference on Machine Vision (ICMV 2023)

View full text Add to dashboard Cite

Drone detection has become an essential task in object detection as drone costs have decreased and drone technology has improved. It is, however, difficult to detect distant drones when there is weak contrast, long range, and low visibility. In this work, we propose several sequence classification architectures to reduce the detected false-positive ratio of drone tracks. Moreover, we propose a new drone vs. bird sequence classification dataset to train and evaluate the proposed architectures. 3D CNN, LSTM, and Transformer based sequence classification architectures have been trained on the proposed dataset to show the effectiveness of the proposed idea. As experiments show, using sequence information, bird classification and overall F1 scores can be increased by up to 73% and 35%, respectively. Among all sequence classification models, R(2+1)D-based fully convolutional model yields the best transfer learning and fine-tuning results.

show abstract

“…There are also Hybrid Approaches, in which CNN and RNN architectures are combined to be able to include both spatial and temporal features. Regarding RNNs, it can be seen that long short-term memory (LSTM) and gated recurrent unit (GRU) networks perform best and are therefore the two most frequently used networks [24].…”

Section: ) Excerpt On Deep Learning For Video Classificationmentioning

confidence: 99%

“…Given the complexity of the models, it must be noted that the inclusion of optical flow via RNNs can lead to better classification results on the one hand, but on the other hand requires very high computing power, which makes real-world use more difficult [24].…”

Section: ) Excerpt On Deep Learning For Video Classificationmentioning

confidence: 99%

A Novel Hybrid Deep Learning Architecture for Dynamic Hand Gesture Recognition

Hax,

Penava,

Krodel

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Hand gestures are a form of natural communication used in human-computer interaction, however, when gestures are video-based, extraction of features for classification is complex. Current machine learning models struggle to achieve high accuracies when using videos recorded in realistic environments. In this work, we propose a hybrid architecture consisting of a recurrent neural network (RNN), including a long short-term memory layer, on top of a convolutional neural network, to recognize dynamic hand gestures recorded in realistic environments. We used a dataset of 6 dynamic hand gestures: scroll-left, scroll-right, scroll-up, scroll-down, zoom-in, and zoom-out. Our implemented inception-v3 model extracted features and provided the wrapped frame-feature map as input for the RNN, which performs the final classification. The proposed model classifies gestures with an average accuracy of 83.66%. By doing so, we intend to narrow the disparity between realistic environments and high accuracy. Finally, we compare the accuracy of our proposed dynamic hand gesture recognition model with that of the benchmark.

show abstract

Deep Learning for Video Classification: A Review

Cited by 12 publications

References 45 publications

Spatiotemporal Modeling for Dynamic Gesture Recognition in Video Streams

Spatiotemporal Modeling for Dynamic Gesture Recognition in Video Streams

Sequence models for drone versus bird classification

A Novel Hybrid Deep Learning Architecture for Dynamic Hand Gesture Recognition

Contact Info

Product

Resources

About