“…Instead, we designed a self-supervised training approach based on a triplet loss architecture that only requires raw endoscopic video frames. Indeed, triplet loss, introduced in(Schroff et al, 2015) as FaceNet model, has been successfully used in several tasks(Grati et al, 2020;Harvill et al, 2019;Kumar et al, 2021). As depicted…”