“…The aforementioned deep human parsers generally achieve promising results, due to the strong learning power of neural networks [46,4] and the plentiful availability of annotated data [22,71]. However, they typically need to pre-segment images into superpixels [40,41], which breaks the end-to-end story and is time-consuming, or rely on extra human landmarks [72,22,71,14,54], requiring additional annotations or pre-trained pose estimators. Though [81] also performs multi-level, fine-grained parsing, it neither explores different information flows within human hierarchies nor models the problem from the view of multi-source information fusion.…”