“…Grounding a scene graph with an image or image description can be beneficial for a variety of downstream tasks, such as image retrieval (Andrews et al, 2019;Johnson et al, 2015), image caption evaluation (Anderson et al, 2016) and image captioning (Zhong et al, 2020). Currently, there are three main research directions to scene graph parsing: those that focus on parsing images (Zellers et al, 2018;Tang et al, 2020;Xu et al, 2017;Zhang et al, 2019a;Cong et al, 2022;Li et al, 2022), text (Anderson et al, 2016;Schuster et al, 2015;Wang et al, 2018;Choi et al, 2022;Andrews et al, 2019;Sharifzadeh et al, 2022), or both modalities (Zhong et al, 2021;Sharifzadeh et al, 2022) into scene graphs. Parsing images involves utilizing an object detection model to identify the location and class of objects, as well as classifiers to determine the relationships and attributes of the objects.…”