“…The applications include image retrieval [46], image captioning [1,45], VQA [51,25] and image generation [24,19]. In order to generate high-quality scene graphs from images, a series of works explore different directions such as utilizing spatial context [61,65,40], graph structure [60,58,34], optimization [8], reinforcement learning [36,51], semi-supervised training [7] or a contrastive loss [66]. These works have achieved excellent results on image datasets [29,42,31].…”