“…Recently, there has been increased interest in TSE applications to speech [17], [38]- [41], music [15], [16], [18], [19], [42]- [46], and universal sounds [2], [11]- [14], [47], [48]. Various types of auxiliary clues have been proposed to identify the target in a sound mixture, including enrollment audio samples [12], [18], [19], [38], [39], [47], class labels [2], [11], [45], video signals of the target source [15], [42], [48], [49], and recently even onomatopoeia [14].…”