“…VGS usually leverages image-speech [25,26] or video-speech [27,28] paired data. In practice, besides speech-image retrieval and alignment [29,30,31,32,33,34], VGS models has also be shown to achieves competitive performance keyword spotting [35], query-by-example research [36], and varies tasks in the SU-PERB benchmark [37,38]. The study of linguistic information learned in VGS models has been attracting increasing attention.…”