“…The data distribution is either modeled explicitly (e.g., variational autoencoder [17]) or implicitly (e.g., generative adversarial networks [5]). On the basis of unconditional generative models, conditional generative models target synthesizing images according to additional context such as image [4,19,25,33,45], segmentation mask [11,26,37,47], and text. The text conditions are often expressed in two formats: natural language sentences [40,42] or scene graphs [14].…”