“…Another interesting task is to localize objects that sound [64,4,54,65,67,11], where the goal is to pinpoint audio sources from the visual data. Other interesting works study audio-visual action recognition [35,38,26,58], audio-visual navigation [22,10,9], talking head synthesis [56], spatial audio from video [43,24,62,42], and visual-to-auditory [33,20].…”