Spatiotemporal Pyramid Network for Video Action Recognition

Wang, Yunbo; Long, Mingsheng; Wang, Jianmin; Yu, Philip S.

doi:10.1109/cvpr.2017.226

Cited by 239 publications

(128 citation statements)

References 33 publications

Supporting

Mentioning

128

Contrasting

Order By: Relevance

“…Recently, researchers have shown increasing interests in exploring structured layers to enhance representation capability of networks [12,25,1,22]. One particular kind of structured layer is concerned with global covariance pooling after the last convolution layer, which has shown impressive improvement over the classical firstorder pooling, successfully used in FGVC [25], visual question answering [15] and video action recognition [34]. Very recent works have demonstrated that matrix square root normalization of global covariance pooling plays a key role in achieving state-of-the-art performance in both large-scale visual recognition [21] and challenging FGVC [24,32].…”

Section: Introductionmentioning

confidence: 99%

Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization

Xie

Wang

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

272

288

View full text Add to dashboard Cite

Global covariance pooling in convolutional neural networks has achieved impressive improvement over the classical first-order pooling. Recent works have shown matrix square root normalization plays a central role in achieving state-of-the-art performance. However, existing methods depend heavily on eigendecomposition (EIG) or singular value decomposition (SVD), suffering from inefficient training due to limited support of EIG and SVD on GPU. Towards addressing this problem, we propose an iterative matrix square root normalization method for fast end-toend training of global covariance pooling networks. At the core of our method is a meta-layer designed with loopembedded directed graph structure. The meta-layer consists of three consecutive nonlinear structured layers, which perform pre-normalization, coupled matrix iteration and post-compensation, respectively. Our method is much faster than EIG or SVD based ones, since it involves only matrix multiplications, suitable for parallel implementation on GPU. Moreover, the proposed network with ResNet architecture can converge in much less epochs, further accelerating network training. On large-scale ImageNet, we achieve competitive performance superior to existing counterparts. By finetuning our models pre-trained on ImageNet, we establish state-of-the-art results on three challenging finegrained benchmarks. The source code and network models will be available at http://www.peihuali.org/iSQRT-COV.

show abstract

Section: Introductionmentioning

confidence: 99%

Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization

Xie

Wang

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

272

288

View full text Add to dashboard Cite

show abstract

“…However, these methods only model the appearance feature of each frame independently while ignore the dynamics between frames, which results in inferior performance when recognizing temporal-related videos. To handle the mentioned drawback, two-stream based methods [10,33,36,3,9] are introduced by modeling appearance and dynamics separately with two networks and fuse two streams through middle or at last. Among these methods, Simonyan et al [22] first proposed the two-stream ConvNet architecture with both spatial and temporal networks.…”

Section: Related Workmentioning

confidence: 99%

“…The existing methods for action recognition can be summarized into two categories. The first type is based on twostream neural networks [10,33,36,9], which consists of an RGB stream with RGB frames as input and a flow stream with optical flow as input. The spatial stream models the appearance features (not spatiotemporal features) without considering the temporal information.…”

Section: Introductionmentioning

confidence: 99%

STM: SpatioTemporal and Motion Encoding for Action Recognition

Jiang

Wang

Gan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

423

239

View full text Add to dashboard Cite

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

show abstract

“…With the advent of deep learning, neural networks are recently employed for action recognition due to its powerful ability in learning robust feature representations [3]- [6], [20]- [27]. The two-stream architecture [3] is a pioneer work to employ deep convolutional network for action recognition in videos, and has become a backbone of many other approaches [4]- [6], [25], [27]- [29]. To address the aggregation of spatial and temporal features, [6] explored different score fusion schemes.…”

Section: Related Work a Rgb-based Action Recognitionmentioning

confidence: 99%

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

Song

Liu

et al. 2020

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

With the prevalence of RGB-D cameras, multimodal video data have become more available for human action recognition. One main challenge for this task lies in how to effectively leverage their complementary information. In this work, we propose a Modality Compensation Network (MCN) to explore the relationships of different modalities, and boost the representations for human action recognition. We regard RGB/optical flow videos as source modalities, skeletons as auxiliary modality. Our goal is to extract more discriminative features from source modalities, with the help of auxiliary modality. Built on deep Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks, our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning, that the network learns to compensate for the loss of skeletons at test time and even at training time. We explore multiple adaptation schemes to narrow the distance between source and auxiliary modal distributions from different levels, according to the alignment of source and auxiliary data in training. In addition, skeletons are only required in the training phase. Our model is able to improve the recognition performance with source data when testing. Experimental results reveal that MCN outperforms stateof-the-art approaches on four widely-used action recognition benchmarks.

show abstract

Spatiotemporal Pyramid Network for Video Action Recognition

Cited by 239 publications

References 33 publications

Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization

Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization

STM: SpatioTemporal and Motion Encoding for Action Recognition

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

Contact Info

Product

Resources

About