“…Graphs are non-Euclidean structured data, which can effectively represent relationships between nodes. Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]).…”