“…Therefore, to overcome these shortcomings in visual trackers [ 11 ], recently, reidentification metrics have been learned and integrated with visual trackers [ 12 ] to follow the target person [ 1 , 2 ]. These reidentification metrics are learned by matching color-histograms and gait features [ 1 , 13 ], as well as extracting deep CNN features to learn deep similarity metrics [ 2 , 3 , 14 , 15 , 16 ]. However, the reidentification metrics in the present works [ 2 , 3 , 14 , 15 , 16 ] are all learned assuming the naïve world, i.e., it is assumed that the outside world is close-set, unimodal (it is assumed the robot only uses RGB sensor), and the target person P 1 appearance remains unchanged across moving different domains.…”