2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.226
|View full text |Cite
|
Sign up to set email alerts
|

Spatiotemporal Pyramid Network for Video Action Recognition

Abstract: Two-stream convolutional networks have shown strong performance in video action recognition tasks. The key idea is to learn spatiotemporal features by fusing convolutional networks spatially and temporally. However, it remains unclear how to model the correlations between the spatial and temporal structures at multiple abstraction levels. First, the spatial stream tends to fail if two videos share similar backgrounds. Second, the temporal stream may be fooled if two actions resemble in short snippets, though a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
128
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 239 publications
(128 citation statements)
references
References 33 publications
0
128
0
Order By: Relevance
“…Recently, researchers have shown increasing interests in exploring structured layers to enhance representation capability of networks [12,25,1,22]. One particular kind of structured layer is concerned with global covariance pooling after the last convolution layer, which has shown impressive improvement over the classical firstorder pooling, successfully used in FGVC [25], visual question answering [15] and video action recognition [34]. Very recent works have demonstrated that matrix square root normalization of global covariance pooling plays a key role in achieving state-of-the-art performance in both large-scale visual recognition [21] and challenging FGVC [24,32].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, researchers have shown increasing interests in exploring structured layers to enhance representation capability of networks [12,25,1,22]. One particular kind of structured layer is concerned with global covariance pooling after the last convolution layer, which has shown impressive improvement over the classical firstorder pooling, successfully used in FGVC [25], visual question answering [15] and video action recognition [34]. Very recent works have demonstrated that matrix square root normalization of global covariance pooling plays a key role in achieving state-of-the-art performance in both large-scale visual recognition [21] and challenging FGVC [24,32].…”
Section: Introductionmentioning
confidence: 99%
“…However, these methods only model the appearance feature of each frame independently while ignore the dynamics between frames, which results in inferior performance when recognizing temporal-related videos. To handle the mentioned drawback, two-stream based methods [10,33,36,3,9] are introduced by modeling appearance and dynamics separately with two networks and fuse two streams through middle or at last. Among these methods, Simonyan et al [22] first proposed the two-stream ConvNet architecture with both spatial and temporal networks.…”
Section: Related Workmentioning
confidence: 99%
“…The existing methods for action recognition can be summarized into two categories. The first type is based on twostream neural networks [10,33,36,9], which consists of an RGB stream with RGB frames as input and a flow stream with optical flow as input. The spatial stream models the appearance features (not spatiotemporal features) without considering the temporal information.…”
Section: Introductionmentioning
confidence: 99%
“…With the advent of deep learning, neural networks are recently employed for action recognition due to its powerful ability in learning robust feature representations [3]- [6], [20]- [27]. The two-stream architecture [3] is a pioneer work to employ deep convolutional network for action recognition in videos, and has become a backbone of many other approaches [4]- [6], [25], [27]- [29]. To address the aggregation of spatial and temporal features, [6] explored different score fusion schemes.…”
Section: Related Work a Rgb-based Action Recognitionmentioning
confidence: 99%