“…However, there are no hardand-fast rules in this research area as researchers have tried other unique approaches and have used varying metrics for the evaluation of their algorithms, for example the mean Average Precision (mAP) [1,24], Lc performance that provides evaluation from an energy perspective [25], F-Score [26], ratio of correctly predicted samples to total number of test samples [27] and area under Receiver Operating Characteristic (auROC) curve [1,10]. Some researchers have used only facial cues [3,4,9,28,29], others have used just audio cues for example [5] and others have used a combination of both cues [1,2]. Some researchers in addition to facial cues have used head movements, hand gestures and prosody [4,28], yet others such as [3,[30][31][32][33] in order to determine active speakers, rely on the use of an array of multiple microphones and cameras because such setup provides directional and spatial information respectively, the problem with such methods apart from the extra overhead is that in most real-life scenarios such as YouTube videos they are not applicable.…”