Automated recognition of human activities or actions has great significance as it incorporates wide-ranging applications, including surveillance, robotics, and personal health monitoring. Over the past few years, many computer vision-based methods have been developed for recognizing human actions from RGB and depth camera videos. These methods include space-time trajectory, motion encoding, key poses extraction, space-time occupancy patterns, depth motion maps, and skeleton joints. However, these camera-based approaches are affected by background clutter and illumination changes and applicable to a limited field of view only. Wearable inertial sensors provide a viable solution to these challenges but are subject to several limitations such as location and orientation sensitivity. Due to the complementary trait of the data obtained from the camera and inertial sensors, the utilization of multiple sensing modalities for accurate recognition of human actions is gradually increasing. This paper presents a viable multimodal feature-level fusion approach for robust human action recognition, which utilizes data from multiple sensors, including RGB camera, depth sensor, and wearable inertial sensors. We extracted the computationally efficient features from the data obtained from RGB-D video camera and inertial body sensors. These features include densely extracted histogram of oriented gradient (HOG) features from RGB/depth videos and statistical signal attributes from wearable sensors data. The proposed human action recognition (HAR) framework is tested on a publicly available multimodal human action dataset UTD-MHAD consisting of 27 different human actions. K-nearest neighbor and support vector machine classifiers are used for training and testing the proposed fusion model for HAR. The experimental results indicate that the proposed scheme achieves better recognition results as compared to the state of the art. The feature-level fusion of RGB and inertial sensors provides the overall best performance for the proposed system, with an accuracy rate of 97.6%.
One of the major requirements of content based image retrieval (CBIR) systems is to ensure meaningful image retrieval against query images. The performance of these systems is severely degraded by the inclusion of image content which does not contain the objects of interest in an image during the image representation phase. Segmentation of the images is considered as a solution but there is no technique that can guarantee the object extraction in a robust way. Another limitation of the segmentation is that most of the image segmentation techniques are slow and their results are not reliable. To overcome these problems, a bandelet transform based image representation technique is presented in this paper, which reliably returns the information about the major objects found in an image. For image retrieval purposes, artificial neural networks (ANN) are applied and the performance of the system and achievement is evaluated on three standard data sets used in the domain of CBIR.
Broadcasters produce enormous numbers of sport videos in cyberspace due to massive viewership and commercial benefits. Manual processing of such content for selecting the important game segments is a laborious activity; therefore, automatic video content analysis techniques are required to effectively handle the huge sports video repositories. The sports video content analysis techniques consider the shot classification as a fundamental step to enhance the probability of achieving better accuracy for various important tasks, i.e., video summarization, key-events selection, and to suppress the misclassification rates. Therefore, in this research work, we propose an effective shot classification method based on AlexNet Convolutional Neural Networks (AlexNet CNN) for field sports videos. The proposed method has an eight-layered network that consists of five convolutional layers and three fully connected layers to classify the shots into long, medium, close-up, and out-of-the-field shots. Through the response normalization and the dropout layers on the feature maps we boosted the overall training and validation performance evaluated over a diverse dataset of cricket and soccer videos. In comparison to Support Vector Machine (SVM), Extreme Learning Machine (ELM), K-Nearest Neighbors (KNN), and standard Convolution Neural Network (CNN), our model achieves the maximum accuracy of 94.07%. Performance comparison against baseline state-of-the-art shot classification approaches are also conducted to prove the superiority of the proposed approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.