Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Long, Xiang; Gan, Chuang; Melo, Gerard de; Wu, Jiajun; Liu, Xiao; Wen, Shifeng

doi:10.1109/cvpr.2018.00817

Cited by 227 publications

(125 citation statements)

References 32 publications

Supporting

Mentioning

123

Contrasting

Unclassified

Order By: Relevance

“…There are numerous followup studies to improve the aforementioned two baselines. For example, TLE [7], ShuttleNet [35], AttentionClusters [29] and NetVlad [1,16] are proposed for better local feature integration instead of directly AVG-Pooling as used in TSN. OFF [37] and motion feature network [26] are proposed for integrating motion information modeling into a spatial CNN network, instead of using two streams.…”

Section: Action Recognitionmentioning

confidence: 99%

“…Existing research efforts [45,36,12,10,13,7,11,41,3,52,33,47,8,1,16,29,35,42,44,56,5,20] mainly focus on building effective and efficient video modeling networks. Generally speaking, they can be divided into two directions, namely, (1) two-stage solutions [8,1,16,29,35] which extract spatial feature vectors from video frames and then integrate the obtained local feature sequence into one compact video descriptor for recognition; (2) the 2D [45,36,10,7] or 3D convolution based [41,11,3,52,33,47,42,44,56,5] end-to-end video classification methods. Though great progress has been achieved by these methods, limited attention is paid to the aforementioned variation of frame-level salience among different frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Tan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

141

126

View full text Add to dashboard Cite

Video Recognition has drawn great research interest and great progress has been made. A suitable frame sampling strategy can improve the accuracy and efficiency of recognition. However, mainstream solutions generally adopt handcrafted frame sampling strategies for recognition. It could degrade the performance, especially in untrimmed videos, due to the variation of frame-level saliency. To this end, we concentrate on improving untrimmed video classification via developing a learning-based frame sampling strategy. We intuitively formulate the frame sampling procedure as multiple parallel Markov decision processes, each of which aims at picking out a frame/clip by gradually adjusting an initial sampling. Then we propose to solve the problems with multi-agent reinforcement learning (MARL). Our MARL framework is composed of a novel RNN-based context-aware observation network which jointly models context information among nearby agents and historical states of a specific agent, a policy network which generates the probability distribution over a predefined action space at each step and a classification network for reward calculation as well as final recognition. Extensive experimental results show that our MARL-based scheme remarkably outperforms hand-crafted strategies with various 2D and 3D baseline methods. Our single RGB model achieves a comparable performance of ActivityNet v1.3 champion submission with multi-modal multi-model fusion and new state-ofthe-art results on YouTube Birds and YouTube Cars.

show abstract

Section: Action Recognitionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Tan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

141

126

View full text Add to dashboard Cite

show abstract

“…Although several works that combine audiovisual sources have been reported in the context first-person action recognition challenges [15,20,21], they provide few details about their models. In [30,31] are proposed attention mechanisms for action recognition using audio as a modality branch. However, the use of audio-visual cues for object interaction recognition is still very limited and previous works only reported results on the full interaction (action) and not its components (verb and noun).…”

Section: Related Workmentioning

confidence: 99%

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Cartas

Luque

Radeva

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Our interaction with the world is an inherently multimodal experience. However, the understanding of humanto-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

show abstract

“…As far as we know, unlike similar task such as action recognition, there has been relatively little work that explores spatial-attention for emotion recognition. Zhang et al [10] proposed attention based on fully convolutional neural network for audio emotion recognition which helped the model to focus on the emotion-relevant regions in speech spectrogram.…”

Section: Related Workmentioning

confidence: 99%

Emotion Recognition with Spatial Attention and Temporal Softmax Pooling

Aminbeidokhti

Pedersoli

Cardinal

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Video-based emotion recognition is a challenging task because it requires to distinguish the small deformations of the human face that represent emotions, while being invariant to stronger visual differences due to different identities. State-of-the-art methods normally use complex deep learning models such as recurrent neural networks (RNNs, LSTMs, GRUs), convolutional neural networks (CNNs, C3D, residual networks) and their combination. In this paper, we propose a simpler approach that combines a CNN pre-trained on a public dataset of facial images with (1) a spatial attention mechanism, to localize the most important regions of the face for a given emotion, and (2) temporal softmax pooling, to select the most important frames of the given video. Results on the challenging EmotiW dataset show that this approach can achieve higher accuracy than more complex approaches.

show abstract

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Cited by 227 publications

References 32 publications

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Emotion Recognition with Spatial Attention and Temporal Softmax Pooling

Contact Info

Product

Resources

About