“…on the target datasets, has achieved great success for image recognition and video understanding [10,4,2,19]. This paper focuses on designing self-supervised learning methods for surgical video understanding with a downstream tasksurgical phase recognition, which aims to predict what phase is occurring for each frame in a video [1,29,13,14,15,6,9,25,24,28]. Self-supervised learning has been widely applied into various medical images, such as X-ray [30], fundus images [16,17], CT [34] and MRI [32,30].…”