“…Existing research efforts [45,36,12,10,13,7,11,41,3,52,33,47,8,1,16,29,35,42,44,56,5,20] mainly focus on building effective and efficient video modeling networks. Generally speaking, they can be divided into two directions, namely, (1) two-stage solutions [8,1,16,29,35] which extract spatial feature vectors from video frames and then integrate the obtained local feature sequence into one compact video descriptor for recognition; (2) the 2D [45,36,10,7] or 3D convolution based [41,11,3,52,33,47,42,44,56,5] end-to-end video classification methods. Though great progress has been achieved by these methods, limited attention is paid to the aforementioned variation of frame-level salience among different frames.…”