“…While modeling time remains a challenge, it also presents a natural source of supervision that has been exploited for self-supervised learning. For example, as a proxy signal by posing pretext tasks involving spatio-temporal jigsaw [1,43,52], video speed [10,16,47,94,109,123], arrow of time [78,80,112], frame/clip ordering [24,70,90,97,116], video continuity [60], or tracking [44,106,111]. Several works have also used contrastive learning to obtain spatio-temporal representations by (i) contrasting temporally augmented versions of a clip [46,77,81], or (ii) encouraging consistency between local and global temporal contexts [9,17,85,122].…”