“…As an input feature, we used logmelspectrogram with 80 mel filterbanks, viewing each spectrogram as a single channel image. As in earlier VAD work [2], [3], [4], [5], we trained each model to detect voice activity in seven neighboring frames [t-19, t-10, t-1, t, t+1, t+10, t+19] for a given frame at time t.…”