“…On the other hand, several recent works incorporate the action (i.e., control command) of the agents in the latent representation [1, 13, 17-19, 29, 41, 42, 46]. While one can use the convolutional recurrent neural network to embed the entire past observations and actions to yield rich temporal information [1], most works first extract the low-dimensional latent dynamics model from the observation with actionconditioned SSM, and integrate the learned latent model into the agent's policy [29] or vision-based planning [19]. However, these approaches use a simple variational autoencoder to extract the latent state and thus cannot represent entity-wise interaction.…”