2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00817
|View full text |Cite
|
Sign up to set email alerts
|

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Abstract: Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract temporal patterns from videos, so as to improve the accuracy of video classification. In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets. We investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video cl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
123
0
1

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 227 publications
(125 citation statements)
references
References 32 publications
1
123
0
1
Order By: Relevance
“…There are numerous followup studies to improve the aforementioned two baselines. For example, TLE [7], ShuttleNet [35], AttentionClusters [29] and NetVlad [1,16] are proposed for better local feature integration instead of directly AVG-Pooling as used in TSN. OFF [37] and motion feature network [26] are proposed for integrating motion information modeling into a spatial CNN network, instead of using two streams.…”
Section: Action Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…There are numerous followup studies to improve the aforementioned two baselines. For example, TLE [7], ShuttleNet [35], AttentionClusters [29] and NetVlad [1,16] are proposed for better local feature integration instead of directly AVG-Pooling as used in TSN. OFF [37] and motion feature network [26] are proposed for integrating motion information modeling into a spatial CNN network, instead of using two streams.…”
Section: Action Recognitionmentioning
confidence: 99%
“…Existing research efforts [45,36,12,10,13,7,11,41,3,52,33,47,8,1,16,29,35,42,44,56,5,20] mainly focus on building effective and efficient video modeling networks. Generally speaking, they can be divided into two directions, namely, (1) two-stage solutions [8,1,16,29,35] which extract spatial feature vectors from video frames and then integrate the obtained local feature sequence into one compact video descriptor for recognition; (2) the 2D [45,36,10,7] or 3D convolution based [41,11,3,52,33,47,42,44,56,5] end-to-end video classification methods. Though great progress has been achieved by these methods, limited attention is paid to the aforementioned variation of frame-level salience among different frames.…”
Section: Introductionmentioning
confidence: 99%
“…Although several works that combine audiovisual sources have been reported in the context first-person action recognition challenges [15,20,21], they provide few details about their models. In [30,31] are proposed attention mechanisms for action recognition using audio as a modality branch. However, the use of audio-visual cues for object interaction recognition is still very limited and previous works only reported results on the full interaction (action) and not its components (verb and noun).…”
Section: Related Workmentioning
confidence: 99%
“…As far as we know, unlike similar task such as action recognition, there has been relatively little work that explores spatial-attention for emotion recognition. Zhang et al [10] proposed attention based on fully convolutional neural network for audio emotion recognition which helped the model to focus on the emotion-relevant regions in speech spectrogram.…”
Section: Related Workmentioning
confidence: 99%