“…As a fundamental component of video understanding, learning spatiotemporal representations remains an active research area in recent years. Since the beginning of the deep learning era, numerous architectures have been proposed to learn spatiotemporal semantics, such as traditional two-stream networks [40,46,59], 3D convolutional neural networks [42,5,20,35,44,50,48,13,12], and spatiotemporal Transformers [3,33,32,10,1,28,52]. As videos are high-dimensional and exhibit substantial spatiotemporal redundancy, training video recognition models from scratch is highly inefficient and may lead to inferior performance.…”