Pedestrian attributes recognition is an important issue in computer vision and has a special role in the field of video surveillance. The previous methods presented to solve this issue are mainly based on multi-label end-to-end deep neural networks. These methods neglect to apply attributes for defining local feature areas and they suffer from the problems of the bounding box presence. A new framework for jointly human semantic parsing and pedestrian attribute recognition to achieve effective attribute recognition is proposed. By extracting human parts via semantic parsing, both semantic and spatial information can be explored with eliminating of background. The framework also uses multi-scale features to employ rich details and contextual information through proposed attribute recognition-bidirectional feature pyramid network. For baseline network that has a significant impact on the performance, EfficientNet-B3 is selected as a baseline network from The EfficientNet family which provides an appropriate trade-off between the three factors of CNNs scaling (depth/width/resolution). Finally, the proposed framework is tested on datasets PETA, RAP and PA-100k. Experimental results show that our method has superior performance in both mean accuracy and instance-based metrics compared to state-of-the-art results.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.