Image data contain spatial information only, thus making two-dimensional (2D) Convolutional Neural Networks (CNN) ideal for solving image classification problems. On the other hand, video data contain both spatial and temporal information that must be simultaneously analyzed to solve action recognition problems. 3D CNNs are successfully used for these tasks, but they suffer from their extensive inherent parameter set. Increasing the network's depth, as is common among 2D CNNs, and hence increasing the number of trainable parameters does not provide a good trade-off between accuracy and complexity of the 3D CNN. In this work, we propose Pooling Block (PB) as an enhanced pooling operation for optimizing action recognition by 3D CNNs. PB comprises three kernels of different sizes. The three kernels simultaneously sub-sample feature maps, and the outputs are concatenated into a single output vector. We compare our approach with three benchmark 3D CNNs (C3D, I3D, and Asymmetric 3D CNN) and three datasets (HMDB51, UCF101, and Kinetics 400). Our PB method yields significant improvement in 3D CNN performance with a comparatively small increase in the number of trainable parameters. We further investigate (1) the effect of video frame dimension and (2) the effect of the number of video frames on the performance of 3D CNNs using C3D as the benchmark.INDEX TERMS Action recognition, convolutional neural network, optimization.
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
Video description is one of the most challenging task in the combined domain of computer vision and natural language processing. Captions for various open and constrained domain videos have been generated in the recent past but descriptions for driving dashcam videos have never been explored to the best of our knowledge. With the aim to explore dashcam video description generation for autonomous driving, this study presents DeepRide: a large-scale dashcam driving video description dataset for locationaware dense video description generation. The human-described dataset comprises visual scenes and actions with diverse weather, people, objects, and geographical paradigms. It bridges the autonomous driving domain with video description by textual description generation of the visual information as seen by a dashcam. We describe 16,000 videos (40 seconds each) in English employing 2,700 man-hours by two highly qualified teams with domain knowledge. The descriptions consist of eight to ten sentences covering each dashcam video's global features and event features in 60 to 90 words. The dataset consists of more than 130K sentences, totaling approximately one million words. We evaluate the dataset by employing location aware vision-language recurrent transformer framework to elaborate on the efficacy and significance of the visio-linguistics research for autonomous vehicles. We provided base line results to evaluate the dataset by employing three existing state-of-the-art recurrent models. The memory augmented transformer performed superior due to its highly summarized memory state for visual information and the sentence history while generating the trip description. Our proposed dataset opens a new dimension of diverse and exciting applications, such as self-driving vehicle reporting, driver and vehicle safety, inter-vehicle road intelligence sharing, and travel occurrence reports. INDEX TERMS dashcam video description, video captioning, autonomous trip descriptionComprehending the localized events of a video appropriately and then transforming the attained visual understand-
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.