“…VAD is a binary classification problem involving both feature extraction and classification. Various speech features can be found in literature such as energy, zero-crossing rate, harmonicity [13], perceptual spectral flux [14], Mel-frequency cepstral coefficient (MFCC) [11], power-normalized cepstral coefficients (PNCCs) [15], entropy [16], Mel-filter bank (MFB) outputs [17] and a posteriori signal-to-noise ratio (SNR) weighted energy distance [3,18]. For the purpose of modelling and classification, popular techniques include Gaussian models [19], Gaussian mixture models (GMM) [2,14,20], super-Gaussian models and their convex combination [9], i-vector [15,21], decision trees [22], support vector machines [23], and neural network models (including deep models) [5,7,24].…”