2022
DOI: 10.3390/app12031686
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of Tacotron Based Synthesizers for Spanish and Basque

Abstract: In this paper, we describe the implementation and evaluation of Text to Speech synthesizers based on neural networks for Spanish and Basque. Several voices were built, all of them using a limited number of data. The system applies Tacotron 2 to compute mel-spectrograms from the input sequence, followed by WaveGlow as neural vocoder to obtain the audio signals from the spectrograms. The limited number of data used for training the models leads to synthesis errors in some sentences. To automatically detect those… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 18 publications
0
4
0
Order By: Relevance
“…Their work concluded that it is sufficient to obtain a speaker's identity (a target speaker's voice attributes) with only one sample of data (i.e., one single sentence from the target speaker). For Spanish and Basque, the performance of the Tacotron2-based system was examined with limited amounts of data [23]. Guided attention was implemented, which provided the system with the explicit duration of the phonemes to reduce lost alignment during the inference process.…”
Section: Limited Data Speaker Adaptationmentioning
confidence: 99%
“…Their work concluded that it is sufficient to obtain a speaker's identity (a target speaker's voice attributes) with only one sample of data (i.e., one single sentence from the target speaker). For Spanish and Basque, the performance of the Tacotron2-based system was examined with limited amounts of data [23]. Guided attention was implemented, which provided the system with the explicit duration of the phonemes to reduce lost alignment during the inference process.…”
Section: Limited Data Speaker Adaptationmentioning
confidence: 99%
“…The popular text to spectrogram models include Tacotron2 , Transformer-TTS (Li et al, 2019), FastSpeech2 (Ren et al, 2020), Fast-Pitch (Łańcucki, 2021), and Glow-TTS . In terms of voice quality the Tacotron2 model is still competitive with other models and less prone to over-fitting in low resource settings (Favaro et al, 2021;Abdelali et al, 2022;García et al, 2022;Finkelstein et al, 2022). There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020). We choose Waveglow since it is competitive with other vocoders and is easy to train (Abdelali et al, 2022;García et al, 2022;Shih et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…However, with the advancement of machine learning and deep learning models, it has become very easy to manipulate the signals and generate spoofed speech to deceive the listener [1]. Moreover, various speech synthesis algorithms, such as GAN [2], Deepvoice [3], tacotron2 [4], and wavenet [5], have gained importance to generate natural speech just like humans and defeat the automatic speaker verification (ASV) systems. For example, false information related to politics based on deep fakes became a significant threat to the US presidential election in 2020 [6].…”
mentioning
confidence: 99%