“…Most of the attempts have replaced or improved one of the components of an i-vector + PLDA system (feature extraction, calculation of sufficient statistics, i-vector extraction or PLDA) with a neural network. As examples, let us mention: using NN bottleneck features instead of conventional MFCC features [1], NN acoustic models replacing Gaussian Mixture Models for extraction of sufficient statistics [2], NNs for either complementing PLDA [3,4] or replacing it [5]. More ambitiously, NNs that take the frame level features of an utterance as input and directly produce an utterance level representation-usually referred to as an embedding-have in the past two years almost replaced the generative i-vector approach in text independent speaker recognition [6,7,8,9,10,11,12].…”