“…Cross-modal learning is explored to understand the natural synchronisation between visuals and the audio [3,5,39]. Audio-visual data is leveraged for audio-visual speech recognition [12,28,59,62], audio-visual event localization [51,52,55], sound source localization [4,29,45,49,51,60], self-supervised representation learning [25,31,35,37,39], generating sounds from video [10,19,38,64], and audio-visual source separation for speech [1,2,13,16,18,37], music [20,22,56,60,61], and objects [22,24,53]. In contrast to all these methods, we perform a different task: to produce binaural two-channel audio from a monaural audio clip using a video's visual stream.…”