The viewpoint variability across a network of non-overlapping cameras is a challenging problem affecting person re-identification performance. In this paper, we investigate how to mitigate the cross-view ambiguity by learning highly discriminative deep features under the supervision of a novel loss function. The proposed objective is made up of two terms, the steering meta center term and the enhancing centers dispersion term, that steer the training process to mining effective intra-class and inter-class relationships in the feature domain of the identities. The effect of our loss supervision is to generate a more expanded feature space of compact classes where the overall level of the inter-identities' interference is reduced. Compared with the existing metric learning techniques, this approach has the advantage of achieving a better optimization because it jointly learns the embedding and the metric contextually. Our technique, by dismissing side-sources of performance gain, proves to enhance the CNN invariance to viewpoint without incurring increased training complexity (like in Siamese or triplet networks) and outperforms many related state-of-the-art techniques on Market-1501 and CUHK03.
Video-based person re-identification deals with the inherent difficulty of matching unregulated sequences with different length and with incomplete target pose/viewpoint structure. Common approaches operate either by reducing the problem to the still images case, facing a significant information loss, or by exploiting inter-sequence temporal dependencies as in Siamese Recurrent Neural Networks or in gait analysis. However, in all cases, the intersequences pose/viewpoint misalignment is not considered, and the existing spatial approaches are mostly limited to the still images context. To this end, we propose a novel approach that can exploit more effectively the rich video information, by accounting for the role that the changing pose/viewpoint factor plays in the sequences matching process. Specifically, our approach consists of two components. The first one attempts to complement the original pose-incomplete information carried by the sequences with synthetic GAN-generated images, and fuse their feature vectors into a more discriminative viewpointinsensitive embedding, namely Weighted Fusion (WF). Another one performs an explicit pose-based alignment of sequence pairs to promote coherent feature matching, namely Weighted-Pose Regulation (WPR). Extensive experiments on two large video-based benchmark datasets show that our approach outperforms considerably existing methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.