“…1). This is in contrast to existing methods that typically extract either global representations for the entire image [6,45,46,7] or video sequence [38,16], thus not focusing on the action itself, or localize the feature extraction process to the action itself via dense trajectories [43,42,9], optical flow [8,45,20] or actionness [44,3,14,48,39,24], thus failing to exploit contextual information. To the best of our knowledge, only two-stream networks [30,8,4,22] have attempted to jointly leverage both information types by making use of RGB frames in conjunction with optical flow to localize the action.…”