“…Since the human body contains highly structural information, many previous methods enhance the pixel-level representations with well-designed architectures that can capture the global context cues, such as global context embeddings [4], [26], generative adversarial networks [25], [27], and recurrent models [28], [29]. Apart from pixel-level semantics, human part classes naturally have rich structural semantics, hence, many works model the body part correlations explicitly by building, e.g., graph neural networks [30], [31], tree-like topology information passing architectures [32], [33], and hierarchical human structures [34], [35], [36]. Another direction is exploiting common semantics among different human-centric tasks, e.g., pose estimation and keypoint detection [6], [37], [37], [38], [39], [40], [41] or other prior human semantics, e.g., edge information or human contour [30], [42].…”