2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.751
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Learning of Long-Term Motion Dynamics for Videos

Abstract: We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos. Given a pair of images from a video clip, our framework learns to predict the long-term 3D motions. To reduce the complexity of the learning framework, we propose to describe the motion as a sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent Neural Network based Encoder-Decoder framework to predict these sequences of flows. We argue that in order for the decoder to rec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
127
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 187 publications
(128 citation statements)
references
References 69 publications
1
127
0
Order By: Relevance
“…Interestingly, they still find it useful to apply their model to multiple optical flow fields and fuse the results with the RGB stream. Some other works use recurrent approaches to model the actions in video [8,18,19,17] or even a single CNN [11]. Donahue et al [8] propose the Long-term Recurrent Convolutional Networks model that combines the CNN features from multiple frames using an LSTM to recognize actions.…”
Section: Action Recognitionmentioning
confidence: 99%
“…Interestingly, they still find it useful to apply their model to multiple optical flow fields and fuse the results with the RGB stream. Some other works use recurrent approaches to model the actions in video [8,18,19,17] or even a single CNN [11]. Donahue et al [8] propose the Long-term Recurrent Convolutional Networks model that combines the CNN features from multiple frames using an LSTM to recognize actions.…”
Section: Action Recognitionmentioning
confidence: 99%
“…Cross-Subject Cross-View HON4D [79] 30.56% 7.26% Super Normal Vector [80] 31.82% 13.61% Joint Angles + HOG2 [81] 32.24% 22.27% Skeletal Quads [72] 38.62% 41.36% Shuffle and Learn [82] 47.50% N/A Histograms of Key Poses [83] 48.90% 57.70% Lie Group [27] 50.08% 52.76% Rolling Rotations [84] 52.10% 53.40% H-RNN [44] (reported in [37]) 59.07% 63.79% P-LSTM [37] 62.93% 70.27% Long-Term Motion [85] 66.22% N/A Spatio-temporal LSTM [46] 69.20% 77.70% Our best configuration 73.40% 80.40% Table 5 Comparison with state-of-the-art methods on NTU-RGB+D dataset [37]. The best results and configuration are marked in bold.…”
Section: Methodsmentioning
confidence: 99%
“…However, they are prone to converging to blurry results as they compute an average of all possible future outcomes for the same starting frame. In [19,13,8], future motion is predicted using either optical flow or filter, where estimation and then corresponding spatial transformation is applied to history frames to produce future frames. The result is sharp but lacks diversity.…”
Section: Related Workmentioning
confidence: 99%