“…Classifiers relying on these humandefined features have often reported promising performance in discriminating diagnosed individuals (e.g., with schizophrenia) from controls without any neuropsychiatric condition, with accuracies between 60% and 90% (see Parola, Simonsen, Bliksted, and Fusaroli [3], Koops, Brederoo, Boer, Nadema, Voppel, and Sommer [5], Fusaroli, Lambrechts, Bang, Bowler, and Gaigg [6], and Rybner, Jessen, Mortensen, et al [11] for an overview of existing studies). Furthermore, several studies have sought to combine the two modalities (speech, text) and found them to contain complementary information for the classification of neuropsychiatric conditions [1,21]. This complementarity may be due to the ability of text models to capture variations in word usage that are not captured by acoustic models, and the ability of acoustic models to identify variations in prosody (e.g.…”