Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach

Phyo, Cho Nilar; Zin, Thi Thi; Tin, Pyke

doi:10.3390/app9091869

Cited by 12 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many HOI recognition systems have been proposed in recent years comprising of both deep learning [18,19,20] and machine learning based approaches [21]. However, in our proposed work, we have developed a machine learning based multi-vision sensors system that incorporates a semantic segmentation technique.…”

Section: Related Workmentioning

confidence: 99%

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

et al. 2021

View full text Add to dashboard Cite

Human-Object Interaction (HOI) recognition, due to its significance in many computer visionbased applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher's Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.INDEX TERMS 3D point cloud, fiducial points, human-object interaction, pixel labeling, semantic segmentation, super-pixels, K-ary tree hashing.

show abstract

Section: Related Workmentioning

confidence: 99%

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The action recognition problem [1,2] can be solved using a video or a single image. However, video-based action recognition has a delay (required for receiving all video frames) and a large computational complexity, which makes it impractical for embedded devices with limited resources [3].…”

Section: Introductionmentioning

confidence: 99%

“…Based on the above analysis, we propose a body-part-aware and attention-based action recognition method using the pose information, as depicted in Figure 1. It consists of three streams: (1) image-based action recognition; (2) attention-based action recognition; and (3) body-part-based action recognition. Moreover, the information that describes the human action should be considered together, which leads us to multitask learning for human pose estimation and action recognition.…”

Section: Introductionmentioning

confidence: 99%

Body-Part-Aware and Multitask-Aware Single-Image-Based Action Recognition

2020

View full text Add to dashboard Cite

Action recognition is an application that, ideally, requires real-time results. We focus on single-image-based action recognition instead of video-based because of improved speed and lower cost of computation. However, a single image contains limited information, which makes single-image-based action recognition a difficult problem. To get an accurate representation of action classes, we propose three feature-stream-based shallow sub-networks (image-based, attention-image-based, and part-image-based feature networks) on the deep pose estimation network in a multitasking manner. Moreover, we design the multitask-aware loss function, so that the proposed method can be adaptively trained with heterogeneous datasets where only human pose annotations or action labels are included (instead of both pose and action information), which makes it easier to apply the proposed approach to new data on behavioral analysis on intelligent systems. In our extensive experiments, we showed that these streams represent complementary information and, hence, the fused representation is robust in distinguishing diverse fine-grained action classes. Unlike other methods, the human pose information was trained using heterogeneous datasets in a multitasking manner; nevertheless, it achieved 91.91% mean average precision on the Stanford 40 Actions Dataset. Moreover, we demonstrated the proposed method can be flexibly applied to multi-labels action recognition problem on the V-COCO Dataset.

show abstract

“…In recent years, the trajectory-based method has achieved great success in the field of behavior recognition [4][5][6]. Unlike the method of directly extracting local features, the trajectory-based method extracts space-time trajectories by matching feature points between adjacent frames and then representing human behavior [7][8][9][10]. Yun [11] used scale-invariant feature transform (SIFT) to match and track spatiotemporal context information between adjacent frames, and Matikainen [12,13] used the Kanade-Lucas-Tomasi (KLT) optical flow method to track feature points between adjacent frames and extract trajectories.…”

Section: Introductionmentioning

confidence: 99%

Human Action Recognition Based on Foreground Trajectory and Motion Difference Descriptors

Dong

Daidi

et al. 2019

Applied Sciences

View full text Add to dashboard Cite

Aimed at the problems of high redundancy of trajectory and susceptibility to background interference in traditional dense trajectory behavior recognition methods, a human action recognition method based on foreground trajectory and motion difference descriptors is proposed. First, the motion magnitude of each frame is estimated by optical flow, and the foreground region is determined according to each motion magnitude of the pixels; the trajectories are only extracted from behavior-related foreground regions. Second, in order to better describe the relative temporal information between different actions, a motion difference descriptor is introduced to describe the foreground trajectory, and the direction histogram of the motion difference is constructed by calculating the direction information of the motion difference per unit time of the trajectory point. Finally, a Fisher vector (FV) is used to encode histogram features to obtain video-level action features, and a support vector machine (SVM) is utilized to classify the action category. Experimental results show that this method can better extract the action-related trajectory, and it can improve the recognition accuracy by 7% compared to the traditional dense trajectory method.

show abstract

Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach

Cited by 12 publications

References 13 publications

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

Body-Part-Aware and Multitask-Aware Single-Image-Based Action Recognition

Human Action Recognition Based on Foreground Trajectory and Motion Difference Descriptors

Contact Info

Product

Resources

About