“…From these observations, existing speech-based ADD systems are focused on different types of frame-level acoustic-prosodic features, such as pitch, energy, phoneme duration [17], zero crossing rate, spectral centroid, spectral bandwidth, spectral rolloff, chroma frequencies, Mel frequency cepstrum coefficients [18] and its derivatives [12,19,20], glottal flow, percentage of voiced/unvoiced frames and harmonic model and phase distortion [21], among others. For the classification module, most of these works adopt traditional machinelearning methods that are fed with compact representations (usually, statistical functionals) of these hand-crafted acoustic characteristics.…”