“…Recently the Zero Resource Speech Challenge [6] was organized to compare the performance of these methods. Typical methods include neural network technology, such as representation learning by autoencoder [7], [8], [9] or discriminative training by ABnet [10], traditional clustering such as GMM [11] or k-means [12], [11], and nonparametric clustering such as the Dirichlet Process Gaussian Mixture Model (DPGMM) trained by Gibbs sampling [1], or variational inference [13], [14]. Among them, DPGMM, which is acoustic clustering, achieved the top performance at Zerospeech 2015 and 2017 [15], [16].…”