“…Several studies have been conducted consecutively on learning multimodal structural knowledge, commencing with visual concept recognition as a basic task (Deng et al, 2009;Davoodi et al, 2023;Kamath et al, 2022), progressing to object grounding (Yu and Ballard, 2004;Shao et al, 2019;Krasin et al, 2017), object attribute detection (Chen et al, 2023a;Patil and Abhyankar, 2023;Bravo et al, 2022), object-object relation detection (Krishna et al, 2016), and finally visual event understanding (Yatskar et al, 2016(Yatskar et al, , 2017Cho et al, 2022;…”