“…Apart from feature-based [1,2,3] and statistical modeling approaches [4,5], recent research effort has been devoted to finding efficient deep-learning-based VAD model architectures. Notable examples include Recurrent Neural Networks (RNN) [6,7,8], Convolutional Neural Networks (CNN) [9,10,11,12], and Convolutional Long Short-Term Memory (LSTM) Deep Neural Networks (CLDNN) [13], which conduct frequency modeling with CNN and temporal modeling with LSTM. LSTM is a popular choice for sequential modeling of VAD tasks [13,6].…”