Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequenceto-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.
Converting whisper to normal vocalized speech has been a hot research topic in speech signal processing area. A complete and large scale whisper database is a major basis for this task. In this paper, we propose a multimodal whisper database in Chinese mandarin. A total of 103 syllables and 100 sentences were carefully selected. 5 male and 5 female participants pronounced the syllables and sentences in whisper and normal styles respectively, result in 4096 parallel speech utterances and 263, 849 frames of voicing face and lip image sequences. The beginning and ending sample point of each syllable were labeled both for speech signal and voicing face video. The lip region of interest were also extracted and provided in the proposed database. Experiments in various speech conversion tasks in different speech database show the effectiveness of the proposed multimodal whisper speech database.
This paper presents a voice conversion (VC) technique under noisy environments. Typically, VC methods use only audio information for conversion in a noiseless environment. However, existing conversion methods do not always achieve satisfactory results in an adverse acoustic environment. To solve this problem, we propose a multimodal voice conversion model based on a deep convolutional neural network (MDCNN) built by combining two convolutional neural networks (CNN) and a deep neural network (DNN) for VC under noisy environments. In the MDCNN, both the acoustic and visual information are incorporated into the voice conversion to improve its robustness in adverse acoustic conditions. The two CNNs are designed to extract acoustic and visual features, and the DNN is designed to capture the nonlinear mapping relation of source speech and target speech. Experimental results indicate that the proposed MDCNN outperforms two existing approaches in noisy environments. INDEX TERMS Audio and video feature fusion, convolutional neural network, deep learning, melfrequency cepstral coefficients, multilayer feedforward neural networks, multimodal voice conversion, noise robustness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.