Advanced aerial images have led to the development of improved human–object interaction recognition (HOI) methods for usage in surveillance, security, and public monitoring systems. Despite the ever-increasing rate of research being conducted in the field of HOI, the existing challenges of occlusion, scale variation, fast motion, and illumination variation continue to attract more researchers. In particular, accurate identification of human body parts, the involved objects, and robust features is the key to effective HOI recognition systems. However, identifying different human body parts and extracting their features is a tedious and rather ineffective task. Based on the assumption that only a few body parts are usually involved in a particular interaction, this article proposes a novel parts-based model for recognizing complex human–object interactions in videos and images captured using ground and aerial cameras. Gamma correction and non-local means denoising techniques have been used for pre-processing the video frames and Felzenszwalb’s algorithm has been utilized for image segmentation. After segmentation, twelve human body parts have been detected and five of them have been shortlisted based on their involvement in the interactions. Four kinds of features have been extracted and concatenated into a large feature vector, which has been optimized using the t-distributed stochastic neighbor embedding (t-SNE) technique. Finally, the interactions have been classified using a fully convolutional network (FCN). The proposed system has been validated on the ground and aerial videos of the VIRAT Video, YouTube Aerial, and SYSU 3D HOI datasets, achieving average accuracies of 82.55%, 86.63%, and 91.68% on these datasets, respectively.
The critical task of recognizing human–object interactions (HOI) finds its application in the domains of surveillance, security, healthcare, assisted living, rehabilitation, sports, and online learning. This has led to the development of various HOI recognition systems in the recent past. Thus, the purpose of this study is to develop a novel graph-based solution for this purpose. In particular, the proposed system takes sequential data as input and recognizes the HOI interaction being performed in it. That is, first of all, the system pre-processes the input data by adjusting the contrast and smoothing the incoming image frames. Then, it locates the human and object through image segmentation. Based on this, 12 key body parts are identified from the extracted human silhouette through a graph-based image skeletonization technique called image foresting transform (IFT). Then, three types of features are extracted: full-body feature, point-based features, and scene features. The next step involves optimizing the different features using isometric mapping (ISOMAP). Lastly, the optimized feature vector is fed to a graph convolution network (GCN) which performs the HOI classification. The performance of the proposed system was validated using three benchmark datasets, namely, Olympic Sports, MSR Daily Activity 3D, and D3D-HOI. The results showed that this model outperforms the existing state-of-the-art models by achieving a mean accuracy of 94.1% with the Olympic Sports, 93.2% with the MSR Daily Activity 3D, and 89.6% with the D3D-HOI datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.