Abstract:Human parsing and pose estimation have recently received considerable interest due to their substantial application potentials. However, the existing datasets have limited numbers of images and annotations and lack a variety of human appearances and coverage of challenging cases in unconstrained environments. In this paper, we introduce a new benchmark named "Look into Person (LIP)" that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future developm… Show more
“…We use the state-of-the-art human parsing model CE2P [23] to predict the human part label maps for all the images in the three benchmark in advance. The CE2P model is trained on the Look Into Person [18] (LIP) dataset, which consists of ∼30, 000 finely annotated images with 20 semantic labels (19 human parts and 1 background). We divide the 20 semantic categories into K groups 2 , and train the CE2P model with the grouped labels.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…E.g. the label set in[18]: background, hat, hair, glove, sunglasses, upper-clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face, rightarm, left-arm, right-leg, left-leg, right-shoe and left-shoe.…”
Person re-identification is a challenging task due to various complex factors. Recent studies have attempted to integrate human parsing results or externally defined attributes to help capture human parts or important object regions. On the other hand, there still exist many useful contextual cues that do not fall into the scope of predefined human parts or attributes. In this paper, we address the missed contextual cues by exploiting both the accurate human parts and the coarse non-human parts. In our implementation, we apply a human parsing model to extract the binary human part masks and a self-attention mechanism to capture the soft latent (non-human) part masks. We verify the effectiveness of our approach with new state-of-the-art performances on three challenging benchmarks: Market-1501, DukeMTMC-reID and CUHK03. Our implementation is available at https://github.com/ggjy/P2Net.pytorch.
“…We use the state-of-the-art human parsing model CE2P [23] to predict the human part label maps for all the images in the three benchmark in advance. The CE2P model is trained on the Look Into Person [18] (LIP) dataset, which consists of ∼30, 000 finely annotated images with 20 semantic labels (19 human parts and 1 background). We divide the 20 semantic categories into K groups 2 , and train the CE2P model with the grouped labels.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…E.g. the label set in[18]: background, hat, hair, glove, sunglasses, upper-clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face, rightarm, left-arm, right-leg, left-leg, right-shoe and left-shoe.…”
Person re-identification is a challenging task due to various complex factors. Recent studies have attempted to integrate human parsing results or externally defined attributes to help capture human parts or important object regions. On the other hand, there still exist many useful contextual cues that do not fall into the scope of predefined human parts or attributes. In this paper, we address the missed contextual cues by exploiting both the accurate human parts and the coarse non-human parts. In our implementation, we apply a human parsing model to extract the binary human part masks and a self-attention mechanism to capture the soft latent (non-human) part masks. We verify the effectiveness of our approach with new state-of-the-art performances on three challenging benchmarks: Market-1501, DukeMTMC-reID and CUHK03. Our implementation is available at https://github.com/ggjy/P2Net.pytorch.
“…To deal with the BG shift problem, one possible solution is to completely remove BGs using the binary body mask obtained by semantic segmentation or human parsing methods. Currently, methods such as Mask-RCNN [13] and JPP-Net [25] can obtain body masks with the pre-trained model on large-scale datasets, e.g., MS COCO [26] and LIP [25]. However, masks obtained by these methods often contained errors due to reasons such as low-resolution person images and highly dynamic person poses.…”
Section: Related Workmentioning
confidence: 99%
“…L 2 distance is applied to minimize the loss. The JPPNet [25] is employed to extract M (I Ds ). We find that masks obtained by JPPNet often contain segmentation errors.…”
Section: Objective Functions In Sbsganmentioning
confidence: 99%
“…One possible solution to sort out BG shift is to directly remove BGs using foreground (FG) masks in a hard manner (i.e., applying the binary masks on original images) [9,17,30,33]. However, it is observed that methods, such as JPPNet [25] and Mask-RCNN [1,13], specifically being designed for removing BG may damage the FG information. By simply removing BGs, this hard manner solution does improve the performance of cross-domian person re-ID to a certain extent (see Table 2).…”
Cross-domain person re-identification (re-ID) is challenging due to the bias between training and testing domains. We observe that if backgrounds in the training and testing datasets are very different, it dramatically introduces difficulties to extract robust pedestrian features, and thus compromises the cross-domain person re-ID performance. In this paper, we formulate such problems as a background shift problem. A Suppression of Background Shift Generative Adversarial Network (SBSGAN) is proposed to generate images with suppressed backgrounds. Unlike simply removing backgrounds using binary masks, SBSGAN allows the generator to decide whether pixels should be preserved or suppressed to reduce segmentation errors caused by noisy foreground masks. Additionally, we take ID-related cues, such as vehicles and companions into consideration. With high-quality generated images, a Densely Associated 2-Stream (DA-2S) network is introduced with Inter Stream Densely Connection (ISDC) modules to strengthen the complementarity of the generated data and ID-related cues. The experiments show that the proposed method achieves competitive performance on three re-ID datasets, i.e., Market-1501, DukeMTMC-reID, and CUHK03, under the crossdomain person re-ID scenario.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.