“…Existing methods for video QA conduct direct answering selection based on the multimodal encoding of questions and videos (Jang et al, 2017;Lei et al, 2018Lei et al, , 2020. In recent years, researchers have proposed many optimization strategies for better performance in video question answering, e.g., designing delicate encoding mechanisms (Kim et al, 2020a;Nuamah, 2021;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Jiang et al, 2020;Kim et al, 2020b;Seo et al, 2021) graphs , adopting video pre-trained language models (Li et al, 2020;Zellers et al, 2021;Li and Wang, 2020;Lei et al, 2021;Sun et al, 2019), and leveraging external knowledge or resources (Chadha et al, 2020;Liu et al, 2020b;Song et al, 2021;. Compared with conventional monomodal question answering tasks such as text QA (Oguz et al, 2021;Zhou et al, 2018; and table QA .…”