Currently, supervised person re-identification (Re-ID) models trained on labeled datasets can achieve high recognition performance in the same data domain. However, accuracy drops dramatically when these models are directly applied to other unlabeled datasets or natural environments, due to a significant sample distribution gap between the two domains. Unsupervised Domain Adaptation (UDA) methods can solve this problem by fine-tuning the model on the target dataset with pseudo-labels generated by the clustering method. Yet, these methods are primarily aimed at the image-based person Re-ID domain. This is because the background noise and interference information are complex and changeable in the video scenarios, resulting in large intra-class distances and small inter-class spaces, which easily lead to noisy labels. Huge domain gap and noisy labels hinder clustering and training processes heavily in the video-based person Re-ID. To address the problem, we propose a novel UDA method via Dynamic Clustering and Co-segment Attentive Learning (DCCAL) for it. DCCAL includes a Dynamic Clustering (DC) module and a Co-segment Attentive Learning (CAL) module to alleviate noisy labels by clustering pedestrians adaptively within different generation processes and reducing domain gap with a co-segmentation-based attention mechanism, respectively. Additionally, we introduce Kullback-Leibler (KL) divergence loss to reduce the distribution of features between two domains for better performance. Experimental results on two large-scale video-based person Re-ID datasets, MARS and DukeMTMC-VideoReID (DukeV), demonstrate exceptional precision performance. Our method outperforms state-of-the-art semi-supervised and unsupervised approaches by 1.1% in Rank-1 and 1.5% in mAP on DukeV, as well as 3.1% and 2.1% in Rank-1 and mAP on MARS, respectively.
INDEX TERMSPerson re-identification, Unsupervised domain adaptation, Dynamic clustering, Co-segment attentive learning.
I. INTRODUCTIONPerson Re-ID aims at retrieving a pedestrian across different non-overlapping cameras or from the same camera at different times. It primarily consists of image-and video-based methods that exploit spatial and spatiotemporal clues to represent a person in image or video sequences, respectively. Existing approaches predominantly rely on supervised learning with labeled datasets. However, labeling work is costly and impractical in realistic environments. Trained models often struggle to adapt to the target domain. The main reason for this is that pedestrians are easily affected by many factors such as illumination, viewpoint, background noise, occlusion, resolution, appearance, posture, etc. It will result in large intraclass differences in the same dataset and a great gap in sample distribution between two different domains, especially in the video-based person Re-ID. To meet the common and realistic difficulty, semi-supervised, unsupervised, and UDA methods for the person Re-ID are studied in many works.Semi-supervised methods use limited labeled samples in a d...