“…[35,36] tailor better recurrent layers that are easy to stack deep for higher-dimensional video information. These recurrent-based methods have advantages over convolutional ones for tasks sensitive to sequence order, such as video future prediction [32,56,67], trajectory prediction [47], and video description [4,55]. While for tasks that more focus on integrated features like action recognition [65,25,37,2,6,24,23], there is still a gap between the recurrent and convolutional models.…”