“…Cross-modal image generation has seen a lot of attention in recent years with work on text-to-image models [12,[32][33][34]40], speech-toimage models [5,20], and image-to-video models [42]. Others have focused on using more direct representations to control generation, including generation from semantic masks [18,26,52], or worked to convert higher-level representations such as text [15,30], scene graphs [7,19,44,51], and bounding boxes [22] into such a semantic mask. There have also been several attempts at learning spatially disentangled scene representations by composing latent features that correspond to specific parts of a scene [10,27].…”