“…The final goal is to detect all objects in the scene, recognize their class, reconstruct their 3D shape, as well as their pose within the overall scene coordinate frame. With the advances of scalable deep learning techniques, the field has progressed from reconstructing the 3D shape of one object in a simple image with trivial background [32,50,40,33,8,17], to limited reasoning about object arrangements in simple multi-object scenes [36,18,23], and finally to unrestricted multi-object 3D reconstruction in complex real-world scenes [49,30,42,12]. This evolution has been dependent on the availability of ever larger and more diverse data sets for training and evaluation [3,10,6,44,47,7,27,15,16] Existing datasets for Semantic 3D scene understanding fall broadly in two categories: synthetic and acquired from real images/videos.…”