“…Previous works have proposed models to controllably generate e.g. images [13,17,38,45,48,51,55,57,73,76,77], videos [6,12,25,37,42,46,64,65,65,71], and audios [1,9,15,22,24,47,62,63], or separate sounds [18,19,79,80,84]. However, most of the audio works are music-related, and only a few attempts have been made to generate visually guided audio in an open domain setup [11,83].…”