“…In an effort to incorporate the relational reasoning ability into the model, a scene graph representation -a structured description that captures semantic summaries of entities and their relationshipshas been presented recently [7]. Since then, a number of works have proposed deep network-based approaches for generating the scene graphs, confirming its importance to the field [8,9,10,11,12,13,14]. While scene graph representation holds tremendous promise, extracting scene graphs from images is known to be challenging.…”