“…They rely on the transcript content to extract features like bag-of-word, POS tags or word embeddings [7,12,16,18,24,27,31]. Mixture of acoustic and lexical features have also been explored [1,13,14,33], which is advantageous when both audio signal and transcript are available.…”