“…Recent works have strongly focused on developing methods to recognize one specific level of granularity from video data. The visual detection of phases [7,15,16,25,30], robotic gestures [2,10,26,29], and instruments [11,14,16,22] has, for instance, seen a surge in research activities, due to their potential impact on developing intra-and postoperative tools for the purposes of monitoring safety, assessing skills, and reporting. Many of these previous works have focused on endoscopic cholecystectomy procedures, utilizing the publicly available large-scale Cholec80 dataset [25], and on cataract surgical procedures, utilizing the popular CATARACTS dataset [11,30].…”