“…The majority of investigated feature sets are based on prosodic features such as fundamental frequency (F0) and power [9,10,11,12,13]. Linguistic features were also investigated such as syntactic structure, turn-ending markers, and language model [14,15]. Moreover, multi-modal features were also considered such as eye-gaze [16,17,18,19], respiration [20,21,22], and head-direction [16,23].…”