Conventional deep neural network (DNN)-based speech enhancement (SE) approaches aim to minimize the mean square error (MSE) between enhanced speech and clean reference. The MSE-optimized model may not directly improve the performance of an automatic speech recognition (ASR) system. If the target is to minimize the recognition error, the recognition results should be used to design the objective function for optimizing the SE model. However, the structure of an ASR system, which consists of multiple units, such as acoustic and language models, is usually complex and not differentiable. In this study, we proposed to adopt the reinforcement learning algorithm to optimize the SE model based on the recognition results. We evaluated the propsoed SE system on the Mandarin Chinese broadcast news corpus (MATBN). Experimental results demonstrate that the proposed method can effectively improve the ASR results with a notable 12.40% and 19.23% error rate reductions for signal to noise ratio at 0 dB and 5 dB conditions, respectively.Index Terms-reinforcement learning, automatic speech recognition, speech enhancement, deep neural network, character error rate
An indoor acoustic scene monitoring system using a periodic impulse signal was previously developed. Compared with the impulse signal, the chirp signal is more robust against environmental noise due to its specific spectro-temporal structure. Such specific structure makes the chirp sound easily detected using a spectro-temporal modulation filtering mechanism. In this paper, we demonstrated a system that employs a two-dimensional spectro-temporal filtering mechanism on a Fourier spectrogram to measure the total energy of the reverberations of the chirp signal as the representation of the acoustic scene. The system compares the reverberation energy difference between consecutive chirps with a predefined threshold to automatically detect the change in the acoustic scene. Simulations were conducted in real living rooms with various types of background noise. Test results demonstrated that the proposed system is much more effective than previously developed systems for detecting the acoustic scene changes due to the intruder silently walking in the rooms. In the biggest test room (18 × 9.8 × 2.5 m3) with heavy background noise, the proposed system can still yield a correct identification rate higher than 80% to the intruder walking at 7 m from the microphone without producing any false alarms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.