“…For our classifier, we will rely on prosodic and acoustic features, in line with findings of salient features reported in related work [6,7]. We leverage from previous work on emotion recognition [12], and extract audio descriptors such as 16 MFCC coefficients, 5 formant frequencies, intensity, pitch, perceptual loudness [13], zero-crossingrate, harmonics-to-noise-ratio, center of spectral mass gravity (centroid), the 95 % roll-off point of spectral energy and the spectral flux, etc, using a 10 ms frame shift. From these descriptors, we derive statistics at the utterance level, separate for voiced and unvoiced regions, on speech parts only.…”