“…On the other hand, implicit NLI systems view the text descriptions as another representation of the visual content, automatically converting the text to visual content and thus enabling users to create visual content implicitly. Extensive research in computer vision, computer graphics, and human-computer interaction has explored the automatic conversion of descriptive text into visual content, such as images [72,73], 3D shapes [9] and scenes [8,15], documents [11], and short video clips [35]. In recent years, with the development of generative adversarial networks, a plethora of systems [32,69,74,76,77] have been proposed to generate visual content based on text descriptions.…”