“…With the release of the first large-scale active speaker detection dataset AVA-ActiveSpeaker [32], researchers have made a series of significant progress in this field [15,36,37,39,46] following the rapid development of deep learning for audio-visual tasks [21]. These studies improve the performance of active speaker detection by inputting face sequences of multiple candidates at the same time [1,2,46], extracting visual features with 3D convolutional neural networks [3,18,47], modeling cross-modal information with complex attention modules [9,43,44], etc, which brings higher memory and computation requirements. Therefore, existing works are difficult to be applied in scenarios requiring real-time processing with limited memory and computational resources, such as automatic video editing and live television.…”