“…Models based on graph neural networks (GNN), including graph convolutional network (GCN) (Kipf and Welling, 2017) and graph attention network (GAT) (Velickovic et al, 2018), have achieved promising performance in many recent research studies, such as visual representation learning (Wu et al, 2019;Xie et al, 2021), text representation learning (Yao et al, 2019;Lou et al, 2021;Liang et al, 2021bLiang et al, , 2022, and recommendation systems (Ying et al, 2018;Tan et al, 2020). Further, there are also some research studies explored graph models to deal with the multi-modal tasks, such as multi-modal sentiment detection (Yang et al, 2021), multi-modal named entity recognition , cross-modal video moment retrieval (Zeng et al, 2021), multi-modal neu-ral machine translation (Yin et al, 2020), and multimodal sarcasm detection (Liang et al, 2021a).…”