“…These have also been applied for the analysis of surgical videos, such as instrument presence detection [11, 12], phase recognition [13, 14], tool location [15–17], and tool pose estimation [2, 18]. For example, a cascading model, which consists of a rough location network and a fine‐grained search network, was proposed by Mishra et al [19] to locate the tool tip. In the work of Chen et al [20], a CNN is trained with the datasets labelled by a line segment detector to detect a tool's tip, and then the spatial and temporal context algorithm [21] is utilised to detect the tool in real time.…”