“…• Video-based sound source localization (SSL) [5], [8], [20], [60], [65], [66], [85], [94], [95], [97], [103], [110], [131], [138], [139], [148], [151], [153], [162]- [166] involves marking pixels' correspondence to each sound source, such as vehicles, in video frames. When the source of sound is a person, we have the audiovisual speaker localization (AVSL) [23], [35] problem, which involves identifying and locating the speaker(s) in an audio-visual scene, such as identifying and locating a person speaking in a video and tracking the speaker [21], [22], [33].…”