Multisensory processing for speech enhancement and magnitude-normalized spectra for speech modeling

Subramanya, Amarnag; Zhang, Zhengyou; Liu, Zicheng; Acero, Alex

doi:10.1016/j.specom.2007.09.002

Cited by 22 publications

(10 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to use the corresponding target speaker as the input speaker, i.e., optimization of reconstructed target spectra and/or performing target-to-source conversion, the notations of x and y, in Eqs. (1)- (7), are swapped with each other. Though, the performance of VAE-based VC is noticeably insufficient because the conversion flow is not considered in the parameter optimization.…”

Section: Conventional Vae-based Vcmentioning

confidence: 99%

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Tobing

Hayashi

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder (VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAEbased VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.

show abstract

Section: Conventional Vae-based Vcmentioning

confidence: 99%

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Tobing

Hayashi

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…The results have been evaluated for GH, LP, optimally modified log spectral amplitude (OM-LSA) [2] and an existing probabilistic approach (PA) [14]. Table 3 presents the LSD results for Gaussian noise with different SNR levels and interfering speech, obtained by using four different speech enhancement methods: GH, LP, OM-LSA and PA.…”

Section: Resultsmentioning

confidence: 99%

Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones

Cohen

Mousazadeh

2014

2014 IEEE China Summit &Amp; International Conference on Signal and Information Processing (ChinaSIP)

View full text Add to dashboard Cite

In this paper, we propose a speech enhancement algorithm for estimating the clean speech using samples of air-conducted and bone-conducted speech signals. We introduce a model in a supervised learning framework by approximating a mapping from concatenation of noisy air-conducted and boneconducted speech to clean speech in the short time Fourier transform domain. Two function extension schemes are utilized: geometric harmonics and Laplacian pyramid. Performances obtained from the two schemes are evaluated and compared in terms of spectrograms and log spectral distance measures.

show abstract

“…Subramanya et al . proposed a statistical feature mapping technique for achieving good noise suppression. Liu et al .…”

Section: Introductionmentioning

confidence: 99%

Multisensory speech enhancement using lower‐frequency components from bone‐conducted speech

Rahman

Saha

Shimamura

2019

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

In this article, we present a multisensory speech enhancement technique by suppressing low-frequency band noise from the speech signal. Speech is often corrupted by color noises like car and multi-talker babble noise that affect mostly the low-frequency region of the speech signal. We propose a multisensory approach for noise reduction by utilizing bone-conducted (BC) speech. Since BC speech is caused by the vibrations that travel through the vocal tract wall and skull bone, it is robust against ambient conditions. Unfortunately, BC speech suffers from intelligibility because it lacks higher-frequency components. However, the low-frequency region of BC speech is completely intelligible. When the normal air-conducted (AC) speech gets corrupted by low-frequency band noise, it can be readily suppressed by replacing the low-frequency region of noisy AC speech with that of BC speech. Because the proposed method does not require any noise estimation, it is thus possible to avoid distortion introduced due to imperfect noise estimation by the traditional enhancement techniques. Evaluation results show that the method can be effectively used for suppressing low-frequency band noise even under very low signal to noise ratio (SNR) conditions.

show abstract

Multisensory processing for speech enhancement and magnitude-normalized spectra for speech modeling

Cited by 22 publications

References 22 publications

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Multisensory speech enhancement in noisy environments using bone-conducted and air-conducted microphones

Multisensory speech enhancement using lower‐frequency components from bone‐conducted speech

Contact Info

Product

Resources

About