“…Generating photographic images from text descriptions (known as Text-to-Image Generation, T2I) is a challenging cross-modal generation technique that is a core component in many computer vision tasks such as Image Editing [28], [51], Story Visualization [53], and Multimedia Retrieval [19]. Compared with the image generation [26], [17], [22] and image processing [6], [5], [23] tasks between the same mode, it is difficult to build the heterogeneous semantic bridge between text and image [54], [48], [40]. Many state-of-the-art T2I algorithms [31], [25], [36], [9], [3], [42] first extract text features, then use Generative Adversarial Networks (GANs) [7] to generate the corresponding image.…”