“…[8], and only the temporal model is optimized [1,13,22,25,30,42,54,60]. However, in specialized domains such as surgical video, well-pretrained CNNs may not be available, meaning that the CNN needs to be finetuned, either in a 2-stage [4,16,29,38,39,61] or in an end-to-end (E2E) training setting. The latter seems preferable since visual and temporal features can be learned jointly and in a more complete context, but BN layers in CNN backbones may cause problems.…”