“…Convolutional Neural Networks for video understanding have been extensively studied and widely applied to video-text pre-training [36], cross-modal analysis [29], video detection [19], ecommerce [6][7][8], adversarial attack [4,[53][54][55], interactive search [37,58], retrieval [17,18,26,57,62,64,66], hyperlinking [9,20,21,39], and caption [2,47,61], in the CNN era; we select and review representative 3D-CNNs as follows. C3D [48] is a pure 3D-CNN pilot based on a new 3D Conv operator and easily outperforms 2D counterparts on video tasks.…”