“…Mapping instruction to action has been studied extensively with intermediate symbolic representations (e.g., Chen and Mooney, 2011; Kim and Mooney, 2012;Artzi and Zettlemoyer, 2013;Artzi et al, 2014;Misra et al, 2015Misra et al, , 2016. Recently, there has been growing interest in direct mapping from raw visual observations to actions (Misra et al, 2017;Xiong et al, 2018;Anderson et al, 2018;Fried et al, 2018). We propose a model that enjoys the benefits of such direct mapping, but explicitly decomposes that task to interpretable goal prediction and action generation.…”