Pedestrian attribute recognition is an important task for intelligent video
surveillance. However, existing methods struggle to accurately localize
discriminative regions for each attribute. We propose Attribute Localization
Transformer (ALFormer), a novel framework to improve spatial localization
through two key components. First, we introduce Mask Contrast Learning
(MCL) to suppress regional feature relevance, forcing the model to focus on
intrinsic spatial areas for each attribute. Second, we design an Attribute
Spatial Memory (ASM) module to generate reliable attention maps that capture
inherent locations for each attribute. Extensive experiments on two
benchmark datasets demonstrate state-of-the-art performance of ALFormer.
Ablation studies and visualizations verify the effectiveness of the proposed
modules in improving attribute localization. Our work provides a simple yet
effective approach to exploit spatial consistency for enhanced pedestrian
attribute recognition.