“…Alternatively, many works learn to predict temporal transformations such as clip order [19,40,51,79], speed [5,8,82] and their combinations [32,48]. These self-supervised temporal representations are effective for classifying and retrieving coarsegrained actions but are challenged by downstream settings with subtle motions [62,70]. Other works utilize the multimodal nature of videos [1,2,20,23,49,52,57] and learn similarity with audio [1,2,52] and optical flow [20,23,54,77].…”