“…One fundamental task is to identify, at any given moment, who is speaking in a complex scene with multiple interacting people. This Active Speaker Detection (ASD) problem [3,4,10,19,26,30,32,36,38] is important for many real-world applications, such as humanrobot interaction [24,37,39], speech diarization [14-16, 41, 44], video re-targeting [5,33,45] etc.…”