Semantic Image Networks for Human Action Recognition

Khowaja, Sunder Ali; Lee, Seok-Lyong

doi:10.1007/s11263-019-01248-3

Cited by 34 publications

(21 citation statements)

References 70 publications

(111 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main difference between the existing studies [46,47] and the proposed method lies in the location of separating the object and the background. Specifically, [46,47] synthesizes the features of discriminative action at the network input stage, while the proposed method adaptively learns discerning features within the network. Section 4.C experimentally analyzes this phenomenon.…”

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 97%

“…Several techniques have been proposed to distinguish between objects and backgrounds by appropriately manipulating input information [46,47]. For example, the dynamic image [46], which is an image created by synthesizing appearance and dynamics, i.e., static and temporal information, can express the temporal information of a single RGB image, and multimodal input based on this can contribute to improve the performance of the action recognition task.…”

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

“…For example, the dynamic image [46], which is an image created by synthesizing appearance and dynamics, i.e., static and temporal information, can express the temporal information of a single RGB image, and multimodal input based on this can contribute to improve the performance of the action recognition task. In addition, the semantic image [47] generated by applying a specific localized sparse segmentation technique to an RGB image was effectively used to extract only significant action changes by separating the behavioral regions and background. The main difference between the existing studies [46,47] and the proposed method lies in the location of separating the object and the background.…”

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

“…In addition, the semantic image [47] generated by applying a specific localized sparse segmentation technique to an RGB image was effectively used to extract only significant action changes by separating the behavioral regions and background. The main difference between the existing studies [46,47] and the proposed method lies in the location of separating the object and the background. Specifically, [46,47] synthesizes the features of discriminative action at the network input stage, while the proposed method adaptively learns discerning features within the network.…”

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

See 3 more Smart Citations

Metric-Based Attention Feature Learning for Video Action Recognition

et al. 2021

View full text Add to dashboard Cite

Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.

show abstract

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 97%

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

Section: Action Recognition With Dynamic Inputsmentioning

confidence: 99%

See 2 more Smart Citations

Metric-Based Attention Feature Learning for Video Action Recognition

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Bilen, H et al [7] use the pre-trained 2D CNN to recognize human actions, they introduce the concept of dynamic image, which is obtained by Rank Pooling [8] on video frames. Sunder Ali Khowaja and Seok-Lyong Lee [9] propose the semantic image for video analysis, which is obtained by applying localized sparse segmentation using global clustering prior to the approximate rank pooling. Recently, more and more 3D CNN are applied in the action recognition, like C3D [10] (Convolutional 3D), I3D (Inflated-3D) [11], ResNet [12] and so forth, where ResNet can be divided into 2D ResNet and 3D ResNet, and all of them have a common concept, that is Residual Learning.…”

Section: Introductionmentioning

confidence: 99%

LaM-2SRN: A Method Which Can Enhance Local Features and Detect Moving Objects for Action Recognition

2020

View full text Add to dashboard Cite

Visual attention mechanism has been widely used in computer vision and plays a vital role in the research of human action recognition. In this paper, we explore a novel moving target detection mechanism for human action recognition and propose a new 3D CNN (3D Convolutional Neural Network) model, dubbed LaM-2SRN (Local Features Enhanced and Moving target detected 2Stream-ResNet) for extracting and learning attention-enhanced spatiotemporal features. The contributions of this paper are as follow: First, the traditional CAM (Class Activation Map) based visual attention algorithm is used to obtain the optical flow information of the human region, thus eliminating the influence of irrelevant optical flow information (such as background clutter). Second, the ViBe algorithm is used to identify the moving target in the continuous frame, retain the optical flow information of the moving target, and make it complement with the optical flow information of the human region, to obtain a complete motion descriptor. After the motion information is marked in the video frame, we put the marked video frames and the original video frames into the 3D CNN and 2D CNN models respectively, to acquired mixed descriptor, which serves as the basis for video classification. Different from most of the previous target detection algorithms, this paper not only detects the human target, but also detects the moving foreground targets in the video, so that the pre-trained CNN model can obtain more complete motion information from videos. We use RGB-only video data for evaluation on two benchmarks: UCF101 and HMDB51, the experimental results demonstrate that our LaM-2SRN is comparable to the previous state-of-the-art algorithms. INDEX TERMS Action recognition, Video understanding, Visual attention, Background detection, CNN.

show abstract

Deep Network-Based Computational Transfer of Artistic Style in Art Analysis

Stork

2023

Modeling Visual Aesthetics, Emotion, and Artistic Style

View full text Add to dashboard Cite

Semantic Image Networks for Human Action Recognition

Cited by 34 publications

References 70 publications

Metric-Based Attention Feature Learning for Video Action Recognition

Metric-Based Attention Feature Learning for Video Action Recognition

LaM-2SRN: A Method Which Can Enhance Local Features and Detect Moving Objects for Action Recognition

Deep Network-Based Computational Transfer of Artistic Style in Art Analysis

Contact Info

Product

Resources

About