Mspec-Net : Multi-Domain Speech Conversion Network

Malaviya, Harshit; Shah, Jui; Patel, Maitreya; Munshi, Jalansh; Patil, Hemant A.

doi:10.1109/icassp40776.2020.9052966

Cited by 9 publications

(10 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While all papers supported their results with at least one objective metric, only a few provided a subjective evaluation, namely: Lian et al (2019a), Parmar et al (2019), Patel et al (2021), Malaviya et al (2020), andPatel et al (2019). The lack of a subjective evaluation is justified in Niranjan et al (2020) since the VC was implemented in the context of ASR and having machine intelligibility in mind.…”

Section: Subjective Metricsmentioning

confidence: 99%

“…In Malaviya et al (2020), a multi-domain speech conversion system is proposed, capable of converting both from Non-Audible Murmur (NAM) and from whispered speech to normal speech, through three domain-specific AutoEncoders (AEs). These AEs are used to obtain an internal representation of features, which are known as latent representations.…”

Section: Mspec-netmentioning

confidence: 99%

“…Some version of a dataset derived from the TIMIT corpus was used in all cases included here. Specifically, whispered TIMIT (wTimit) 1 was used by Niranjan et al (2020), Patel et al (2021), Parmar et al (2019), andPatel et al (2019), while CSTR-NAM-TIMIT Plus was used by Gao et al (2021), Lian et al (2019a), Malaviya et al (2020), Pang et al (2020), Yu et al (2019), and Lian et al (2019b). The wTIMIT dataset uses the prompts in TIMIT, a well-known corpus often used for benchmarking in speech recognition, including 450 phonetically balanced sentences both in normal and whispered speech.…”

Section: Datasetsmentioning

confidence: 99%

“…where F0 C and F0 R are the fundamental frequencies of each of the K time-aligned frames of the converted and the reference normal speech, respectively. This metric was adopted in Gao et al (2021) and Lian et al (2019a) and in all the AHOCODER based works, namely Parmar et al (2019), Malaviya et al (2020), Patel et al (2021), andPatel et al (2019), in which cases the RMSE was calculated over the log(F0) instead. Additionally, in Patel et al (2021), the Kullback-Leibler Divergence (KLD) and Jensen-Shannon Divergence (JSD) between predicted and original F0 for speaker-specific tasks were also considered.…”

Section: Objective Metricsmentioning

confidence: 99%

See 3 more Smart Citations

Machine Learning Approaches for Whisper to Normal Speech Conversion

Oliveira

2022

UPjeng

View full text Add to dashboard Cite

Whispered speech is a mode of speech that differs from normal speech due to the absence of a periodic component, namely the Fundamental Frequency that characterizes the pitch, among other spectral and temporal differences. Much attention has been given in recent years to the application of Machine Learning techniques for voice conversion tasks. The whisper-to-normal speech conversion is particularly challenging, however, especially with respect to the Fundamental Frequency estimation. Based on the most recent literature, this survey assesses the state-of-the-art regarding Machine Learning based whisper-to-normal speech conversion, identifying trends both on modeling and training approaches. The proposed solutions include Generative Adversarial Network based, Autoencoder based and Bidirectional Long Short-Term Memory based frameworks, among other Deep Neural Network based architectures. In addition to Parallel versus Non-Parallel training, time-alignment requirements and strategies, datasets, vocoder usage, as well as both objective and subjective evaluation metrics are also covered by the present survey.

show abstract

Section: Subjective Metricsmentioning

confidence: 99%

Section: Mspec-netmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Objective Metricsmentioning

confidence: 99%

See 2 more Smart Citations

Machine Learning Approaches for Whisper to Normal Speech Conversion

Oliveira

2022

UPjeng

View full text Add to dashboard Cite

show abstract

“…Such as LSTM, MSpeC-Net, DiscoGAN, CycleGAN, etc. are proposed in the literature [2], [9], [12], [19]- [24]. Moreover, CycleGAN has shown state-of-the-art result for WHSP2SPCH conversion including F 0 prediction on parallel data, which relies on the availability of particular speaker's whisper, and normal speech [25].…”

Section: Introductionmentioning

confidence: 99%

CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

Patel¹,

Purohit²,

Shah³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, Generative Adversarial Networks (GAN)based methods have shown remarkable performance for the Voice Conversion and WHiSPer-to-normal SPeeCH (WHSP2SPCH) conversion. One of the key challenges in WHSP2SPCH conversion is the prediction of fundamental frequency (F0). Recently, authors have proposed state-of-the-art method Cycle-Consistent Generative Adversarial Networks (CycleGAN) for WHSP2SPCH conversion. The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping. This leads to additional nonlinear noise in predicted F0. To suppress this noise, we propose Cycle-in-Cycle GAN (i.e., CinC-GAN). It is specially designed to increase the effectiveness in F0 prediction without losing the accuracy of MCC mapping. We evaluated the proposed method on a non-parallel setting and analyzed on speaker-specific, and gender-specific tasks. The objective and subjective tests show that CinC-GAN significantly outperforms the CycleGAN. In addition, we analyze the CycleGAN and CinC-GAN for unseen speakers and the results show the clear superiority of CinC-GAN.

show abstract

Generative adversarial networks for whispered to voiced speech conversion: a comparative study

Wagner,

Baumann,

Bocklet

2024

Int J Speech Technol

View full text Add to dashboard Cite

Generative Adversarial Networks (GANs) have demonstrated promising results as end-to-end models for whispered to voiced speech conversion. Leveraging non-autoregressive systems like GANs capable of performing conditional waveform generation eliminates the need for separate models to estimate voiced speech features, and leads to faster inference compared to autoregressive methods. This study aims to identify the optimal GAN architecture for the whispered to voiced speech conversion task by comparing six state-of-the-art models. Furthermore, we present a method for evaluating the preservation of speaker identity and local accent, using embeddings obtained from speaker- and language identification systems. Our experimental results show that building the speech conversion system based on the HiFi-GAN architecture yields the best objective evaluation scores, outperforming the baseline by $$\sim$$ ∼ 9% relative using frequency-weighted Signal-to-Noise Ratio and Log Likelihood Ratio, as well as by $$\sim$$ ∼ 29% relative using Root Mean Squared Error. In subjective tests, HiFi-GAN yielded a mean opinion score of 2.9, significantly outperforming the baseline with a score of 1.4. Furthermore, HiFi-GAN enhanced ASR performance and preserved speaker identity and accent, with correct language detection rates of up to $$\sim$$ ∼ 98%.

show abstract

Mspec-Net : Multi-Domain Speech Conversion Network

Cited by 9 publications

References 16 publications

Machine Learning Approaches for Whisper to Normal Speech Conversion

Machine Learning Approaches for Whisper to Normal Speech Conversion

CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

Generative adversarial networks for whispered to voiced speech conversion: a comparative study

Contact Info

Product

Resources

About