This paper proposes a novel method for sports video scene classification with the particular intention of video summarization. Creating and publishing a shorter version of the video is more interesting than a full version due to instant entertainment. Generating shorter summaries of the videos is a tedious task that requires significant labor hours and unnecessary machine occupation. Due to the growing demand for video summarization in marketing, advertising agencies, awareness videos, documentaries, and other interest groups, researchers are continuously proposing automation frameworks and novel schemes. Since the scene classification is a fundamental component of video summarization and video analysis, the quality of scene classification is particularly important. This article focuses on various practical implementation gaps over the existing techniques and presents a method to achieve high-quality of scene classification. We consider cricket as a case study and classify five scene categories, i.e., batting, bowling, boundary, crowd and close-up. We employ our model using pre-trained AlexNet Convolutional Neural Network (CNN) for scene classification. The proposed method employs new, fully connected layers in an encoder fashion. We employ data augmentation to achieve a high accuracy of 99.26% over a smaller dataset. We conduct a performance comparison against baseline approaches to prove the superiority of the method as well as state-of-the-art models. We evaluate our performance results on cricket videos and compare various deep-learning models, i.e., Inception V3, Visual Geometry Group (VGGNet16, VGGNet19) , Residual Network (ResNet50), and AlexNet. Our experiments demonstrate that our method with AlexNet CNN produces better results than existing proposals.
Rapid expansion and the novel phenomenon of deep learning have manifested a variety of proposals and concerns in the area of video description, particularly in the recent past. Automatic event localization and textual alternatives generation for the complex and diverse visual data supplied in a video can be articulated as video description, bridging the two leading realms of computer vision and natural language processing. Several sequence-to-sequence algorithms are being proposed by splitting the task into two segments, namely encoding, i.e., getting and learning the insights of the visual representations, and decoding, i.e., transforming the learned representations to a sequence of words, one at a time. Implemented deep learning approaches have gained a lot of recognition for the reason of their superior computing capabilities and tremendous performance. However, the accomplishment of these algorithms strongly depends on the nature, diversity, and amount of data they are trained, validated and tested on. Techniques applied on insufficient and inadequate train/test data cannot deliver promising conclusions, consequently making it complicated to evaluate the quality of generated results. This survey focuses explicitly on the benchmark datasets, and evaluation metrics developed and deployed for video description tasks and their capabilities and limitations. Finally, we concluded with the need for essential enhancements and encouraging research directions on the topic.
Solving parametric partial differential equations using artificial intelligence is taking the pace. It is primarily because conventional numerical solvers are computationally expensive and require much time to converge a solution. However, physics informed deep learning as an alternate learns functional spaces directly and provides approximation reasonably quickly compared to conventional numerical solvers. Among various approaches, the Fourier transform approach directly learns the generalized functional space using deep learning. This work proposes a novel deep Fourier neural network that employs a Fourier neural operator as a fundamental building block and employs spectral feature aggregation to extrude the maximum. The proposed model offers superior accuracy and lower relative loss. We employ one and two dimensional time-independent as well as two-dimensional time-dependent equations. We employ three benchmark datasets to evaluate our contributions, i.e., Burgers' equation as one dimensional, Darcy Flow equation as two dimensional, and Navier-Stokes as two spatial dimensional with one temporal dimension as benchmark datasets. We further employ a case study of fluid-structure interaction used for the machine component designing process and a computation fluid dynamics simulation dataset was generated using the ANSYS-CFX software system to evaluate the regression of the temporal behavior of the fluid. Our proposed method achieves superior performance on all four datasets employed and shows improvements to baseline. We achieve a reduced relative error on the Burgers' equation by approximately 30%, Darcy Flow equation by approximately 35%, and Navier-Stokes equation by approximately 20%.
Recent research to solve the parametric partial differential equations shifted the focus of conventional neural networks from finite-dimensional Euclidean space to generalized functional spaces. Neural operators learn the generalized function mapping directly, which was achieved primarily using numerical solvers for decades. However, numerical operators are computationally expensive and require enormous time to solve partial differential equations. In this work, we propose a spatio-spectral neural operator combining spectral feature learning and spatial feature learning. We formulate a novel neural network architecture to produce a state-of-the-art reproduction accuracy and a much reduced relative error over partial differential equations solutions. Fluid-structure interaction is a primary concern while designing the process of a machine component. Numerical simulations of fluid flow are a time-intensive task that attracted machine learning researchers to provide solutions to achieve it efficiently. Computational fluid dynamics has made noticeable progress and produced state-of-the-art numerical simulations during the last few decades. We propose a deep learning approach by employing a novel neural operator to deal with Computational fluid dynamics using deep learning. We perform the experiments over one and two-dimensional simulations using the Burgers equation, Darcy flow, and Navier-Stokes equations as benchmarks. In addition, we considered a case study to demonstrate the transient fluid flow prediction past through an immersed body. Our proposed solution achieves superior accuracy to the current level of research on learning-based solvers and Fourier neural operators. Our proposed approach achieves the lowest relative error on the Burgers' equation, Darcy flow, Navier-stokes equation. Furthermore, we achieve a superior relative mean squared error for the case study dataset under experiments.
Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.