Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation

Yu, Chen; Shen, Chunhua; Wei, Xiu-Shen; Liu, Lingqiao; Yang, Jian

doi:10.1109/iccv.2017.137

Cited by 312 publications

(180 citation statements)

References 27 publications

Supporting

Mentioning

179

Contrasting

Order By: Relevance

“…This process could increase the quality of predictions, since the generator is stimulated to produce more plausible predictions. Another application of GANs in that sense is to enforce the structural representation of the human body [12].…”

Section: D Pose Estimationmentioning

confidence: 99%

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Luvizon

Picard

Tabia

2020

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Human pose estimation and action recognition are related tasks since both problems are strongly dependent on the human body representation and analysis. Nonetheless, most recent methods in the literature handle the two problems separately. In this work, we propose a multi-task framework for jointly estimating 2D or 3D human poses from monocular color images and classifying human actions from video sequences. We show that a single architecture can be used to solve both problems in an efficient way and still achieves state-of-the-art or comparable results at each task while running with a throughput of more than 100 frames per second. The proposed method benefits from high parameters sharing between the two tasks by unifying still images and video clips processing in a single pipeline, allowing the model to be trained with data from different categories simultaneously and in a seamlessly way. Additionally, we provide important insights for end-to-end training the proposed multi-task model by decoupling key prediction parts, which consistently leads to better accuracy on both tasks. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU RGB+D) demonstrate the effectiveness of our method on the targeted tasks. Our source code and trained weights are publicly available at

show abstract

Section: D Pose Estimationmentioning

confidence: 99%

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Luvizon

Picard

Tabia

2020

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…GAN has been mainly used for generating images. One of the first work to apply adversarial training to improve structured output learning might be [22], where a discriminator loss is used to distinguish predicted pose and ground-truth pose for pose estimation from monocular images. Recently, GANs have also been adopted in depth estimation.…”

Section: Related Workmentioning

confidence: 99%

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

Zhang

Cao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

114

View full text Add to dashboard Cite

Accuracy of depth estimation from static images has been significantly improved recently, by exploiting hierarchical features from deep convolutional neural networks (CNNs). Compared with static images, vast information exists among video frames and can be exploited to improve the depth estimation performance. In this work, we focus on exploring temporal information from monocular videos for depth estimation. Specifically, we take the advantage of convolutional long short-term memory (CLSTM) and propose a novel spatial-temporal CSLTM (ST-CLSTM) structure. Our ST-CLSTM structure can capture not only the spatial features but also the temporal correlations/consistency among consecutive video frames with negligible increase in computational cost. Additionally, in order to maintain the temporal consistency among the estimated depth frames, we apply the generative adversarial learning scheme and design a temporal consistency loss. The temporal consistency loss is combined with the spatial loss to update the model in an end-to-end fashion. By taking advantage of the temporal information, we build a video depth estimation framework that runs in real-time and generates visually pleasant results. Moreover, our approach is flexible and can be generalized to most existing depth estimation frameworks. Code is available at:

show abstract

“…Temporally adversarial training To further leverage temporal cues, DKD adopts the adversarial training strategy to learn proper supervision in the temporal dimension for improving the pose kernel distillator. Adversarial training was only exploited for images in the spatial dimension in prior works [6,5]. In contrast, our proposed temporally adversarial training strategy aims to provide constraints for pose changes in the temporal dimension, helping estimate coherent human poses in consecutive frames of videos.…”

Section: Formulationmentioning

confidence: 99%

“…The temporally adversarial discriminator learns to distinguish the groundtruth change of joint confidence maps over neighboring frames from the predicted change, and thus supervises DKD to generate temporally coherent poses. In contrast to previous adversarial training methods [6,5] that learn structure priors in the spatial dimension for recognition over still images, our method constrains the pose variations in the temporal dimension of videos, enforcing plausible changes of estimated poses in videos. In addition, this discriminator can be removed during the inference phase, thus introducing no additional computation.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Kernel Distillation for Efficient Pose Estimation in Videos

Nie

Luo

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Existing video-based human pose estimation methods extensively apply large networks onto every frame in the video to localize body joints, which suffer high computational cost and hardly meet the low-latency requirement in realistic applications. To address this issue, we propose a novel Dynamic Kernel Distillation (DKD) model to facilitate small networks for estimating human poses in videos, thus significantly lifting the efficiency. In particular, DKD introduces a light-weight distillator to online distill pose kernels via leveraging temporal cues from the previous frame in a one-shot feed-forward manner. Then, DKD simplifies body joint localization into a matching procedure between the pose kernels and the current frame, which can be efficiently computed via simple convolution. In this way, DKD fast transfers pose knowledge from one frame to provide compact guidance for body joint localization in the following frame, which enables utilization of small networks in video-based pose estimation. To facilitate the training process, DKD exploits a temporally adversarial training strategy that introduces a temporal discriminator to help generate temporally coherent pose kernels and pose estimation results within a long range. Experiments on Penn Action and Sub-JHMDB benchmarks demonstrate outperforming efficiency of DKD, specifically, 10× flops reduction and 2× speedup over previous best model, and its state-of-the-art accuracy. * This work was partly done while Xuecheng was an intern as Snap Inc. Small CNN Pose Kernel Distillator Matching Frame t-1 Frame t Small CNN Matching Frame t+1 Small CNN Pose Kernel Distillator Matching (a) Our DKD Model RNN or Optical Flow Large CNN Classification Frame t-1 RNN or Optical Flow Large CNN Frame t Large CNN Frame t+1 Classification Classification (b) The Traditional Model

show abstract

Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation

Cited by 312 publications

References 27 publications

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

Dynamic Kernel Distillation for Efficient Pose Estimation in Videos

Contact Info

Product

Resources

About