“…and video-text [65,64,87,26,54,1,5], and video-audio [42,53,29] representation learning. While the use of transformer architectures for video is still in its infancy, concurrent works [7,2,51,22] have already demonstrated that this is a highly promising direction. However, these approaches do not have a mechanism for reasoning about motion paths, treating time as just another dimension, unlike our approach.…”