“…In audiovisual speech recognition, the main goal is to improve speech recognition performance by combining the visual information with audio/speech signals [13], [14]. Content analysis such as automatic shot-boundary detection, multimedia event detection, searching visual and multimodal content in a dataset are few examples of multimedia content indexing and retrieval [15], [16], [17], [18]. Human-robot collaboration, human emotion recognition, human-computer interaction, and automatic assessment of depression and stress comes under the category of understanding human behaviour from multimodal input data during social interactions [19], [20], [21], [22], [23].…”