“…The value of leveraging simulated environments to augment training has been explored in various vision tasks, such as object detection, semantic segmentation, and pose estimation [28,32,37,48,57,66]. Synthetic environments have also been applied to vision and language problems, such as embodied agent learning [11,12,16,30,51,55], using platforms such as the Unreal Engine [36,40], and using existing scenes and spaces manually created by specialized designers and content creators [60]. Within the task of VQA, to train and diagnose model performance on compositional questions, synthetic datasets such as CLEVR [26] and CLEVRER [64] have been proposed.…”