2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00118
|View full text |Cite
|
Sign up to set email alerts
|

Gate-Shift Networks for Video Action Recognition

Abstract: Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatiotemporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
109
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 182 publications
(109 citation statements)
references
References 47 publications
0
109
0
Order By: Relevance
“…Table V analyzes the generalization ability of the proposed method for the Something-V1 dataset. Note that the proposed method achieved SOTA performance of 52.08%, surpassing the latest methods [18,35] that require a vast amount of computation over 50 GFLOPs. In addition, as the number of frames used for learning decreased, the proposed method had a higher performance than GSN.…”
Section: Quantitative Resultsmentioning
confidence: 90%
See 3 more Smart Citations
“…Table V analyzes the generalization ability of the proposed method for the Something-V1 dataset. Note that the proposed method achieved SOTA performance of 52.08%, surpassing the latest methods [18,35] that require a vast amount of computation over 50 GFLOPs. In addition, as the number of frames used for learning decreased, the proposed method had a higher performance than GSN.…”
Section: Quantitative Resultsmentioning
confidence: 90%
“…As the training and evaluation datasets changed, a different backbone structure was used in this experiment. In detail, the InceptionV3 of GSN [35] was employed as a backbone, and the classifier was trained after attaching the DA module with the output terminal of InceptionV3 (see Fig. 2).…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Another alternative is to extract appearance features from the individual frames and perform a temporal pooling operation to encode their temporal evolution [10], [13], [68]. Recent approaches explore the feasibility of temporal modeling with 2D CNNs [34], [55], [65]. Another approach includes using two CNNs, each encoding an RGB image for appearance cues and stacks of optical flow for motion cues [11], [12], [51].…”
Section: Related Workmentioning
confidence: 99%