Gate-Shift Networks for Video Action Recognition

Sudhakaran, Swathikiran; Escalera, Sérgio; Lanz, Oswald

doi:10.1109/cvpr42600.2020.00118

Cited by 182 publications

(109 citation statements)

References 47 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…Table V analyzes the generalization ability of the proposed method for the Something-V1 dataset. Note that the proposed method achieved SOTA performance of 52.08%, surpassing the latest methods [18,35] that require a vast amount of computation over 50 GFLOPs. In addition, as the number of frames used for learning decreased, the proposed method had a higher performance than GSN.…”

Section: Quantitative Resultsmentioning

confidence: 90%

“…As the training and evaluation datasets changed, a different backbone structure was used in this experiment. In detail, the InceptionV3 of GSN [35] was employed as a backbone, and the classifier was trained after attaching the DA module with the output terminal of InceptionV3 (see Fig. 2).…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…Sudhakaran et al proposed a few gating methods to upgrade the spatio-temporal decomposition of 3D convolution kernel, and achieved the SOTA performance [35]. Our approach differs from the gate-shift network (GSN) of [35] in two respects: 1) Unlike GSN, which considers only the local view, the proposed method analyzes the image characteristics from a global perspective, similar to the nonlocal network [37]. 2) Through discriminative learning based on feature map similarity, the proposed method can emphasize the target object(s).…”

Section: A Video Action Recognitionmentioning

confidence: 99%

“…In this paper, ResNext-101 [1] was used as the backbone for UCF-101 and HMDB-51 datasets, and InceptionV3 of GSN [35] was adopted as the backbone for Diving48 and Something-V1 datasets. To construct the same experimental environment as the previous techniques, pre-trained ResNext-101 with different datasets and InceptionV3 architectures were used in the fine-tuning process of action recognition.…”

Section: B Experimental Setupmentioning

confidence: 99%

See 3 more Smart Citations

Metric-Based Attention Feature Learning for Video Action Recognition

et al. 2021

View full text Add to dashboard Cite

Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.

show abstract

Section: Quantitative Resultsmentioning

confidence: 90%

Section: Quantitative Resultsmentioning

confidence: 99%

Section: A Video Action Recognitionmentioning

confidence: 99%

Section: B Experimental Setupmentioning

confidence: 99%

See 2 more Smart Citations

Metric-Based Attention Feature Learning for Video Action Recognition

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Another alternative is to extract appearance features from the individual frames and perform a temporal pooling operation to encode their temporal evolution [10], [13], [68]. Recent approaches explore the feasibility of temporal modeling with 2D CNNs [34], [55], [65]. Another approach includes using two CNNs, each encoding an RGB image for appearance cues and stacks of optical flow for motion cues [11], [12], [51].…”

Section: Related Workmentioning

confidence: 99%

Learning to Recognize Actions on Objects in Egocentric Video With Attention Dictionaries

Sudhakaran

Escalera

Lanz

2023

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a redesigned output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.

show abstract

Action Recognition Using Local Visual Descriptors and Inertial Data

Alhersh

Belhaouari

Stuckenschmidt

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Different body sensors and modalities can be used in human action recognition, either separately or simultaneously. Multi-modal data can be used in recognizing human action. In this work we are using inertial measurement units (IMUs) positioned at left and right hands with first person vision for human action recognition. A novel statistical feature extraction method was proposed based on curvature of the graph of a function and tracking left and right hand positions in space. Local visual descriptors have been used as features for egocentric vision. An intermediate fusion between IMUs and visual sensors has been performed. Despite of using only two IMUs sensors with egocentric vision, our classification result achieved is 99.61% for recognizing nine different actions. Feature extraction step could play a vital step in human action recognition with limited number of sensors, hence, our method might indeed be promising.

show abstract

Gate-Shift Networks for Video Action Recognition

Cited by 182 publications

References 47 publications

Metric-Based Attention Feature Learning for Video Action Recognition

Metric-Based Attention Feature Learning for Video Action Recognition

Learning to Recognize Actions on Objects in Egocentric Video With Attention Dictionaries

Action Recognition Using Local Visual Descriptors and Inertial Data

Contact Info

Product

Resources

About