“…For the DSD task, previous research was carried out using traditional algorithmically extracted speech features or statistical descriptors derived from that, including the mean and standard deviation of the pitch [ 23 ], the MFCCs and their delta and double-delta coefficients [ 24 , 25 ], jitter, the harmonic-to-noise ratio (HNR) [ 26 ], or other acoustic and prosodic features [ 25 , 27 ] based on the ComParE feature set [ 28 ]. As for the machine learning models used, these include support vector machines (SVMs) [ 16 , 23 , 25 , 26 ], random forests (RFs) [ 20 , 23 , 25 , 27 , 29 , 30 ], MLPs [ 23 , 25 , 27 , 31 ], logistic regression [ 25 , 31 ], or ensemble methods with multiple classifiers and average/majority voting [ 27 ]. Specific results previously reported in literature for the databases used in this work for the DSD task, RLDD and RODeCAR, are provided in Section 3.5 .…”