Unsupervised Learning of Long-Term Motion Dynamics for Videos

Luo, Zelun; Peng, Bingwei; Huang, De-An; Alahi, Alexandre; Li, Feifei

doi:10.1109/cvpr.2017.751

Cited by 187 publications

(128 citation statements)

References 69 publications

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…Interestingly, they still find it useful to apply their model to multiple optical flow fields and fuse the results with the RGB stream. Some other works use recurrent approaches to model the actions in video [8,18,19,17] or even a single CNN [11]. Donahue et al [8] propose the Long-term Recurrent Convolutional Networks model that combines the CNN features from multiple frames using an LSTM to recognize actions.…”

Section: Action Recognitionmentioning

confidence: 99%

Convolutional Relational Machine for Group Activity Recognition

Azar

Atigh

Nickabadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

118

View full text Add to dashboard Cite

We present an end-to-end deep Convolutional Neural Network called Convolutional Relational Machine (CRM) for recognizing group activities that utilizes the information in spatial relations between individual persons in image or video. It learns to produce an intermediate spatial representation (activity map) based on individual and group activities. A multi-stage refinement component is responsible for decreasing the incorrect predictions in the activity map. Finally, an aggregation component uses the refined information to recognize group activities. Experimental results demonstrate the constructive contribution of the information extracted and represented in the form of the activity map. CRM shows advantages over state-of-the-art models on Volleyball and Collective Activity datasets.

show abstract

Section: Action Recognitionmentioning

confidence: 99%

Convolutional Relational Machine for Group Activity Recognition

Azar

Atigh

Nickabadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

118

View full text Add to dashboard Cite

show abstract

“…Cross-Subject Cross-View HON4D [79] 30.56% 7.26% Super Normal Vector [80] 31.82% 13.61% Joint Angles + HOG2 [81] 32.24% 22.27% Skeletal Quads [72] 38.62% 41.36% Shuffle and Learn [82] 47.50% N/A Histograms of Key Poses [83] 48.90% 57.70% Lie Group [27] 50.08% 52.76% Rolling Rotations [84] 52.10% 53.40% H-RNN [44] (reported in [37]) 59.07% 63.79% P-LSTM [37] 62.93% 70.27% Long-Term Motion [85] 66.22% N/A Spatio-temporal LSTM [46] 69.20% 77.70% Our best configuration 73.40% 80.40% Table 5 Comparison with state-of-the-art methods on NTU-RGB+D dataset [37]. The best results and configuration are marked in bold.…”

Section: Methodsmentioning

confidence: 99%

Learning to recognise 3D human action from a new skeleton‐based representation using deep convolutional neural networks

Pham

Khoudour

Crouzil

et al. 2019

IET Computer Vision

View full text Add to dashboard Cite

Recognizing human actions in untrimmed videos is an important challenging task. An effective 3D motion representation and a powerful learning model are two key factors influencing recognition performance. In this paper we introduce a new skeletonbased representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a color encoding process. By normalizing the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the color-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. We then design and train different Deep Convolutional Neural Networks (D-CNNs) based on the Residual Network architecture (ResNet) on the obtained image-based representations to learn 3D motion features and classify them into classes. Our method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches whilst requiring less computation for training and prediction.

show abstract

“…However, they are prone to converging to blurry results as they compute an average of all possible future outcomes for the same starting frame. In [19,13,8], future motion is predicted using either optical flow or filter, where estimation and then corresponding spatial transformation is applied to history frames to produce future frames. The result is sharp but lacks diversity.…”

Section: Related Workmentioning

confidence: 99%

Video Generation From Single Semantic Label Map

Pan

Wang²,

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difficult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the first frame. Then we animate the scene based on its semantic meaning to obtain temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical flow as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods. The source code will be released on https://github.com/junting/seg2vid.

show abstract

Unsupervised Learning of Long-Term Motion Dynamics for Videos

Cited by 187 publications

References 69 publications

Convolutional Relational Machine for Group Activity Recognition

Convolutional Relational Machine for Group Activity Recognition

Learning to recognise 3D human action from a new skeleton‐based representation using deep convolutional neural networks

Video Generation From Single Semantic Label Map

Contact Info

Product

Resources

About