“…However, these approaches still used handcrafted features. With the advent of deep learning, learning representations from data has been extensively studied [14,15,45,58,53,54,25,7,62,56,41,3]. Of these, one of the most popular frameworks has been the approach of Simonyan et al [45], who introduced the idea of training separate color and optical flow networks to capture local properties of the video.…”