Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing

Gong, Ke; Liang, Xiaodan; Zhang, Dongyu; Shen, Xiaohui; Li, Lin

doi:10.1109/cvpr.2017.715

Cited by 480 publications

(349 citation statements)

References 36 publications

Supporting

Mentioning

338

Contrasting

Unclassified

Order By: Relevance

“…2) We analyze three important sources of information, leading to a novel network architecture that conditionally incorporates direct, top-down, and bottom-up inferences. 3) Our model achieves state-of-the-art performances for comprehensive evaluations on four public datasets (LIP [22], PASCAL-Person-Part [71], ATR [39] and Fashion Clothing [49]). Testing with more than 20K images demonstrates the superiority over existing methods of exploiting compositional structural information for human parsing.…”

Section: Introductionmentioning

confidence: 94%

“…The aforementioned deep human parsers generally achieve promising results, due to the strong learning power of neural networks [46,4] and the plentiful availability of annotated data [22,71]. However, they typically need to pre-segment images into superpixels [40,41], which breaks the end-to-end story and is time-consuming, or rely on extra human landmarks [72,22,71,14,54], requiring additional annotations or pre-trained pose estimators. Though [81] also performs multi-level, fine-grained parsing, it neither explores different information flows within human hierarchies nor models the problem from the view of multi-source information fusion.…”

Section: Related Workmentioning

confidence: 99%

“…The random scale is set from 0.5 to 2.0, while the crop size is set to 473×473. For optimization, we adopt [22]. (Higher values are better.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Testing Phase: Following general protocol [76,54], we average the per-pixel classification scores at multiple scales with flipping, i.e., the scale is 0.5 to 1.5 (in increments of 0.25) times the original size. Our model does not require any other pre-/post-processing steps (i.e., over-segmentation [40,38], human pose [71], CRF [71]), and thus achieves a processing speed of 23.0fps, averaged on PASCAL-Person-Part, which is faster than previous deep human parsers, such as Joint [71] (0.1fps), Attention+SSL [22] (2.0fps), MMAN [50] (3.5fps) and MuLA [54] (15fps). Reproducibility: Our method is implemented on PyTorch and trained on four NVIDIA Tesla V100 GPUs with a 32GB memory per-card.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…promising results, they fail to make full use of the rich structures in this task. Some others use extra human joints to better constrain body configurations [22,71,54], requiring additional training data of human keypoints and ignoring the compositional relations within human bodies.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Learning Compositional Neural Information Fusion for Human Parsing

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

132

View full text Add to dashboard Cite

This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete human parsing. We formulate the approach as a neural information fusion framework. Our model assembles the information from three inference processes over the hierarchy: direct inference (directly predicting each part of a human body using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively. In addition, the fusion of multi-source information is conditioned on the inputs, i.e., by estimating and considering the confidence of the sources. The whole model is end-to-end differentiable, explicitly modeling information flows and structures. Our approach is extensively evaluated on four popular datasets, outperforming the state-of-the-arts in all cases, with a fast processing speed of 23fps. Our code and results have been released to help ease future research in this direction. * Equal contribution. † Corresponding author: Yanwei Pang.

show abstract

Section: Introductionmentioning

confidence: 94%

Section: Related Workmentioning

confidence: 99%

“…The random scale is set from 0.5 to 2.0, while the crop size is set to 473×473. For optimization, we adopt [22]. (Higher values are better.…”

Section: Implementation Detailsmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Learning Compositional Neural Information Fusion for Human Parsing

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

132

View full text Add to dashboard Cite

show abstract

Instance-Level Human Parsing via Part Grouping Network

Gong

Liang

et al. 2018

Lecture Notes in Computer Science

Self Cite

303

246

View full text Add to dashboard Cite

Instance-level human parsing towards real-world human analysis scenarios is still under-explored due to the absence of sufficient data resources and technical difficulty in parsing multiple instances in a single pass. Several related works all follow the "parsing-by-detection" pipeline that heavily relies on separately trained detection models to localize instances and then performs human parsing for each instance sequentially. Nonetheless, two discrepant optimization targets of detection and parsing lead to suboptimal representation learning and error accumulation for final results. In this work, we make the first attempt to explore a detection-free Part Grouping Network (PGN) for efficiently parsing multiple people in an image in a single pass. Our PGN reformulates instance-level human parsing as two twinned sub-tasks that can be jointly learned and mutually refined via a unified network: 1) semantic part segmentation for assigning each pixel as a human part (e.g., face, arms); 2) instance-aware edge detection to group semantic parts into distinct person instances. Thus the shared intermediate representation would be endowed with capabilities in both characterizing fine-grained parts and inferring instance belongings of each part. Finally, a simple instance partition process is employed to get final results during inference. We conducted experiments on PASCAL-Person-Part dataset and our PGN outperforms all state-of-the-art methods. Furthermore, we show its superiority on a newly collected multi-person parsing dataset (CIHP) including 38,280 diverse images, which is the largest dataset so far and can facilitate more advanced human analysis. The CIHP benchmark and our source code are available at http://sysu-hcp.net/lip/.

show abstract