Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation

Nie, Xuecheng; Feng, Jiashi; Yan, Shuicheng

doi:10.1007/978-3-030-01228-1_31

Cited by 140 publications

(81 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LIP [22]: We compare our method with 11 state-of-thearts on LIP val set in Table 1. Our method achieves a huge boost in average IoU (4.64% better than the second best method, CE2P [56] and 8.4% better than the third best, MuLA [54]). To verify its effectiveness in detail, we report per-class IoU in Table 2.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…The aforementioned deep human parsers generally achieve promising results, due to the strong learning power of neural networks [46,4] and the plentiful availability of annotated data [22,71]. However, they typically need to pre-segment images into superpixels [40,41], which breaks the end-to-end story and is time-consuming, or rely on extra human landmarks [72,22,71,14,54], requiring additional annotations or pre-trained pose estimators. Though [81] also performs multi-level, fine-grained parsing, it neither explores different information flows within human hierarchies nor models the problem from the view of multi-source information fusion.…”

Section: Related Workmentioning

confidence: 99%

“…promising results, they fail to make full use of the rich structures in this task. Some others use extra human joints to better constrain body configurations [22,71,54], requiring additional training data of human keypoints and ignoring the compositional relations within human bodies.…”

Section: Introductionmentioning

confidence: 99%

“…Overall, our model consistently obtains promising results over different datasets, which clearly demonstrates its superior performance and strong generalizability. This also distinguishes our model from several previous state-of-theart deep human parsers, such as[22,71,14,54], since it does not use extra pose annotations during training.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Learning Compositional Neural Information Fusion for Human Parsing

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

132

View full text Add to dashboard Cite

This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete human parsing. We formulate the approach as a neural information fusion framework. Our model assembles the information from three inference processes over the hierarchy: direct inference (directly predicting each part of a human body using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively. In addition, the fusion of multi-source information is conditioned on the inputs, i.e., by estimating and considering the confidence of the sources. The whole model is end-to-end differentiable, explicitly modeling information flows and structures. Our approach is extensively evaluated on four popular datasets, outperforming the state-of-the-arts in all cases, with a fast processing speed of 23fps. Our code and results have been released to help ease future research in this direction. * Equal contribution. † Corresponding author: Yanwei Pang.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Learning Compositional Neural Information Fusion for Human Parsing

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

132

View full text Add to dashboard Cite

show abstract

“…In particular, we first use a 1×1 convolution parameterized by V on f t+1 . Then we apply k t in a dynamic convolution layer [21], which is the same with traditional convolution layer, just replacing the pre-learned static convolution kernels with the dynamically learned ones. Finally, we adopt another 1×1 convolution with U to produce h t+1 .…”

Section: Network Architecturementioning

confidence: 99%

Dynamic Kernel Distillation for Efficient Pose Estimation in Videos

Nie

Luo

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Existing video-based human pose estimation methods extensively apply large networks onto every frame in the video to localize body joints, which suffer high computational cost and hardly meet the low-latency requirement in realistic applications. To address this issue, we propose a novel Dynamic Kernel Distillation (DKD) model to facilitate small networks for estimating human poses in videos, thus significantly lifting the efficiency. In particular, DKD introduces a light-weight distillator to online distill pose kernels via leveraging temporal cues from the previous frame in a one-shot feed-forward manner. Then, DKD simplifies body joint localization into a matching procedure between the pose kernels and the current frame, which can be efficiently computed via simple convolution. In this way, DKD fast transfers pose knowledge from one frame to provide compact guidance for body joint localization in the following frame, which enables utilization of small networks in video-based pose estimation. To facilitate the training process, DKD exploits a temporally adversarial training strategy that introduces a temporal discriminator to help generate temporally coherent pose kernels and pose estimation results within a long range. Experiments on Penn Action and Sub-JHMDB benchmarks demonstrate outperforming efficiency of DKD, specifically, 10× flops reduction and 2× speedup over previous best model, and its state-of-the-art accuracy. * This work was partly done while Xuecheng was an intern as Snap Inc. Small CNN Pose Kernel Distillator Matching Frame t-1 Frame t Small CNN Matching Frame t+1 Small CNN Pose Kernel Distillator Matching (a) Our DKD Model RNN or Optical Flow Large CNN Classification Frame t-1 RNN or Optical Flow Large CNN Frame t Large CNN Frame t+1 Classification Classification (b) The Traditional Model

show abstract