“…A straightforward deep learning solution to visual control problems is to learn action-conditioned video prediction models [38,14,8,53] and then perform Monte-Carlo importance sampling and optimization algorithms, such as the cross-entropy methods, over available behaviors [15,12,29]. Hot topics in video prediction mainly includes long-term and high-fidelity future frames generation [44,43,51,5,52,50,54,41,40,36,56,28,2], dynamics uncertainty modeling [1,10,48,31,7,16,55], object-centric scene decomposition [47,27,18,58,3], and space-time disentanglement [49,27,19,6]. The corresponding technical improvements mainly involve the use of more effective neural architectures, novel probabilistic modeling methods, and specific forms of video representation.…”