Hierarchical Human Parsing With Typed Part-Relation Reasoning

Yang, Yi; Zhu, Hailong; Dai, Jifeng; Pang, Yanwei; Shen, Jianbing; Shao, Ling

doi:10.1109/cvpr42600.2020.00895

Cited by 110 publications

(57 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Differentiable attention mechanisms enable a neural network to focus more on relevant elements of the input than on irrelevant parts. With their popularity in the field of natural language processing [8,39,43,49,60], attention modeling is rapidly adopted in various computer vision tasks, such as image recognition [14,23,58,66,73], domain adaptation [67,83], human pose estimation [9,63,77], object detection [4] and image generation [76,81,86]. Further, co-attention mechanisms become an essential tool in many vision-language applications and sequential modeling tasks, such as visual question answering [41,44,75,78], visual dialog [74,84], vision-language navigation [68], and video segmentation [42,61], showing its effectiveness in capturing the underlying relations between different entities.…”

Section: Related Workmentioning

confidence: 99%

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

Sun

Yang

Dai

et al. 2020

Lecture Notes in Computer Science

Self Cite

242

153

View full text Add to dashboard Cite

This paper studies the problem of learning semantic segmentation from image-level supervision only. Current popular solutions leverage object localization maps from classifiers as supervision signals, and struggle to make the localization maps capture more complete object content. Rather than previous efforts that primarily focus on intra-image information, we address the value of cross-image semantic relations for comprehensive object pattern mining. To achieve this, two neural coattentions are incorporated into the classifier to complimentarily capture cross-image semantic similarities and differences. In particular, given a pair of training images, one co-attention enforces the classifier to recognize the common semantics from co-attentive objects, while the other one, called contrastive co-attention, drives the classifier to identify the unshared semantics from the rest, uncommon objects. This helps the classifier discover more object patterns and better ground semantics in image regions. In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference, hence eventually benefiting semantic segmentation learning. More essentially, our algorithm provides a unified framework that handles well different WSSS settings, i.e., learning WSSS with (1) precise image-level supervision only, (2) extra simple single-label data, and (3) extra noisy web data. It sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.

show abstract

Section: Related Workmentioning

confidence: 99%

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

Sun

Yang

Dai

et al. 2020

Lecture Notes in Computer Science

Self Cite

242

153

View full text Add to dashboard Cite

show abstract

“…Fine-grained human semantic segmentation, as one of the central tasks in human understanding, has applications in human-centric vision [41,66,43], human-robot interaction [11] and fashion analysis [46]. However, previous studies mainly focus on category-level human parsing [32,15,12,35,47,63,64,49]; only very few human parsers are specifically designed for the instance-aware setting. As of to date, there exist two paradigms for instance-aware human parsing: top-down and bottom-up.…”

Section: Related Workmentioning

confidence: 99%

Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

Zhou

Yang

Liu

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

View full text Add to dashboard Cite

To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multiperson joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra's cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.

show abstract

“…Other studies have combined additional human prior information for human parsing. For instance, Wang et al [32] assembled the compositional hierarchy of human bodies for efficient and complete human parsing, and Ji et al [19] exploited the intrinsic physiological structure of the human body by designing a novel semantic neural tree for human parsing. Utilizing grammar rules in a cascaded and parallel manner, Zhang et al [42] employed the inherent hierarchical structure of the human body and the relationship of different human parts to achieve impressive human parsing results.…”

Section: Related Workmentioning

confidence: 99%

“…Ji et al [19] designed a novel semantic neural tree to encode the physiological structure of the human body and achieved competitive results. Wang et al [31,32] exploited deep graph networks and hierarchical human structures to capture the relation information of human parts and obtained better performances. These mechanisms involve designing a complex semantic tree or message-passing network that leads to heavy computing complexity while improving performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CDGNet: Class Distribution Guided Network for Human Parsing

Liu¹,

Choi²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The objective of human parsing is to partition a human in an image into constituent parts. This task involves labeling each pixel of the human image according to the classes. Since the human body comprises hierarchically structured parts, each body part of an image can have its sole position distribution characteristics. Probably, a human head is less likely to be under the feet, and arms are more likely to be near the torso. Inspired by this observation, we make instance class distributions by accumulating the original human parsing label in the horizontal and vertical directions, which can be utilized as supervision signals. Using these horizontal and vertical class distribution labels, the network is guided to exploit the intrinsic position distribution of each class. We combine two guided features to form a spatial guidance map, which is then superimposed onto the baseline network by multiplication and concatenation to distinguish the human parts precisely. We conducted extensive experiments to demonstrate the effectiveness and superiority of our method on three well-known benchmarks: LIP, ATR, and CIHP databases.

show abstract

Hierarchical Human Parsing With Typed Part-Relation Reasoning

Cited by 110 publications

References 46 publications

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

CDGNet: Class Distribution Guided Network for Human Parsing

Contact Info

Product

Resources

About