Evaluation of Tacotron Based Synthesizers for Spanish and Basque

García, Víctor Manuel Giménez; Hernáez, Inma; Navas, Eva

doi:10.3390/app12031686

Cited by 4 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their work concluded that it is sufficient to obtain a speaker's identity (a target speaker's voice attributes) with only one sample of data (i.e., one single sentence from the target speaker). For Spanish and Basque, the performance of the Tacotron2-based system was examined with limited amounts of data [23]. Guided attention was implemented, which provided the system with the explicit duration of the phonemes to reduce lost alignment during the inference process.…”

Section: Limited Data Speaker Adaptationmentioning

confidence: 99%

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Mandeel¹,

Al-Radhi²,

Csapó³

2022

Infocommunications journal

View full text Add to dashboard Cite

Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.

show abstract

Section: Limited Data Speaker Adaptationmentioning

confidence: 99%

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Mandeel¹,

Al-Radhi²,

Csapó³

2022

Infocommunications journal

View full text Add to dashboard Cite

show abstract

“…The popular text to spectrogram models include Tacotron2 , Transformer-TTS (Li et al, 2019), FastSpeech2 (Ren et al, 2020), Fast-Pitch (Łańcucki, 2021), and Glow-TTS . In terms of voice quality the Tacotron2 model is still competitive with other models and less prone to over-fitting in low resource settings (Favaro et al, 2021;Abdelali et al, 2022;García et al, 2022;Finkelstein et al, 2022). There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020). We choose Waveglow since it is competitive with other vocoders and is easy to train (Abdelali et al, 2022;García et al, 2022;Shih et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the existing TTS system. Finally, the decoder of this model is fine-tuned on only 3 hours of target Hindi speaker data to enable rapid speaker adaptation. We show the importance of this dual pre-training and decoder-only fine-tuning using subjective MOS evaluation. Using transfer learning from high-resource language and synthetic corpus we present a low-cost solution to train a custom TTS model.

show abstract

“…However, with the advancement of machine learning and deep learning models, it has become very easy to manipulate the signals and generate spoofed speech to deceive the listener [1]. Moreover, various speech synthesis algorithms, such as GAN [2], Deepvoice [3], tacotron2 [4], and wavenet [5], have gained importance to generate natural speech just like humans and defeat the automatic speaker verification (ASV) systems. For example, false information related to politics based on deep fakes became a significant threat to the US presidential election in 2020 [6].…”

mentioning

confidence: 99%

EDL-Det: A Robust TTS Synthesis Detector Using VGG19-Based YAMNet and Ensemble Learning Block

Mahum,

Irtaza,

Javed

2023

IEEE Access

View full text Add to dashboard Cite

Various algorithms exist for the audio deep fake synthesis, such as deep voice, tacotron, fastspeech, and imitation techniques. Despite the existence of various spoofing speech detectors, they are not ready to distinguish unseen audio samples with high precision. In this study, we suggest a robust model, namely Ensemble Deep Learning based Detector (EDL-Det) to detect text-to-speech (TTS) and categorize it into spoofed and bonafide classes. Our proposed model is an improved method based on YAMNet employing VGG19 as a base network instead of MobileNet combined with two other deep learning(DL) methods. Our proposed system effectively analyzes the mel-spectrograms generated from input audio to extract the better artifacts underlying the audio signals. We have added an ensemble learning block that consists of ResNet50, and InceptionNetv2. First, we convert speech into mel-spectrograms that consist of time-frequency representations. Second, we train our model using the ASVspoof-2019 dataset. In the end, we classified the audios converting them into mel-spectrograms using our trained binary classifier along with a majority voting scheme by three networks. Due to deep convolutional network architecture, our proposed model effectively extracts the most representative features from the mel-spectrograms. Furthermore, we have performed extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. Additionally, our proposed model is robust enough to identify the unseen spoofed audios and accurately classify the attacks based on cloning algorithms.

show abstract

Evaluation of Tacotron Based Synthesizers for Spanish and Basque

Cited by 4 publications

References 18 publications

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

EDL-Det: A Robust TTS Synthesis Detector Using VGG19-Based YAMNet and Ensemble Learning Block

Contact Info

Product

Resources

About