“…To address this and allow the learning of improved frame-level acoustic features, we build on recent work in "zero-resource" speech processing, where the goal is to learn robust feature representations without access to any labelled speech data (Versteegh et al, 2016;Dunbar et al, 2017Dunbar et al, , 2019. Various different features and learning approaches have been considered ranging from conventional speech features (Carlin et al, 2011;Vavrek et al, 2012;Lopez-Otero et al, 2016), to posteriorgrams from probabilistic mixture models (Zhang and Glass, 2009;Heck et al, 2017;Heck et al, 2018), to latent representations computed by neural networks (Badino et al, 2015;Renshaw et al, 2015;Zeghidour et al, 2016;Riad et al, 2018;Eloff et al, 2019). Among these, multilingual bottleneck feature (BNF) extractors, trained on well-resourced but out-of-domain languages, have been found by several authors to improve on the performance of MFCCs and other representations (Veselỳ et al, 2012;Vu et al, 2012;Thomas et al, 2012;Cui et al, 2015;Alumäe et al, 2016;Chen et al, 2017;Yuan et al, 2017;Hermann and Goldwater, 2018;Hermann et al, 2021).…”