Replay speech answer-sheet detection is an urgent problem to be solved for intelligent language learning system. Traditional features used in replay speech detection are often extracted from power spectrum. However, power spectrum may not be the optimal spectrum to extract feature for replay speech answer-sheet detection because it doesn't consider the characteristic of replay speech. In order to solve this limitation, this paper proposes a method of power spectrum decomposition for replay speech answer-sheet detection on intelligent language learning system. Log frame-wise normalization spectrum (LFNS) and log spectral energy (LSE) which consider the characteristic of replay speech, are obtained by decomposing log power spectrum based on constant-Q transform. Next, the other two features are obtained at the base of LFNS and LSE. The first is constant-Q normalization octave coefficients (CNOC) which is obtained by combining LFNS and octave subband transform. The second is CNOC-LSE that is obtained by combining CNOC and LSE. Then LFNS, CNOC and CNOC-LSE are fed into frame-and utterance-based neural networks. Experimental results show that the proposed LFNS can outperform the conventional log power spectrum, CNOC and CNOC-LSE can perform better than most of commonly used features. We found that utterance-based neural network outperforms frame-based neural network with the same inputs. In addition, handcrafted features give worse performance than corresponding spectrum for the utterance-based neural network while the opposite conclusion can be obtained for the frame-based neural network.
In order to detect the fake speech answer-sheet obtained by text-to-speech or voice conversion algorithms on intelligent English learning app, this paper proposed the idea of block-based deep neural network (BDNN), which is constructed to extract deep features. It is composed of classification-based blocks. In BDNNs, the deep feature extracted by a preceding block serves as the input to its following block. Different from the commonly used inputs for deep features extraction, the magnitude-phase spectrum, including magnitude and phase information, is exploited to feed BDNNs in order to extract effective deep features. The experimental results show that the proposed method can detect most of fake speech answer-sheet on intelligent English learning app.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.