Rapid Development of TTS Corpora for Four South African Languages

Niekerk, Daniel van; Heerden, Charl Johannes van; Davel, Marelie H.; Kleynhans, Neil; Kjartansson, Oddur; Jansche, Martin; Ha, Linne

doi:10.21437/interspeech.2017-1139

Cited by 18 publications

(10 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We used speech data from 300 speakers [23,24,25,26,27,28,29,30,31] consisting of over 18 languages/dialects including Table 3:…”

Section: Experimental Conditionsmentioning

confidence: 99%

High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

Tobing¹,

Toda²

2021

Preprint

View full text Add to dashboard Cite

This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband Wav-eRNN with data-driven linear prediction for discrete waveform modeling (MWDLP). MWDLP employs a coarse-fine bit Wa-veRNN architecture for 10-bit mu-law waveform modeling. A sparse gated recurrent unit with a relatively large size of hidden units is utilized, while the multiband modeling is deployed to achieve real-time low-latency usage. A novel technique for data-driven linear prediction (LP) with discrete waveform modeling is proposed, where the LP coefficients are estimated in a data-driven manner. Moreover, a novel loss function using short-time Fourier transform (STFT) for discrete waveform modeling with Gumbel approximation is also proposed. The experimental results demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions, where the number of training utterances is limited to 60 per speaker, while allowing for real-time low-latency processing using a single core of ∼ 2.1-2.7 GHz CPU with ∼ 0.57-0.64 real-time factor including input/output and feature extraction.

show abstract

“…We used speech data from 300 speakers [23,24,25,26,27,28,29,30,31] consisting of over 18 languages/dialects including Table 3:…”

Section: Experimental Conditionsmentioning

confidence: 99%

High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

Tobing¹,

Toda²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The above experiment was partially repeated on two additional corpora to determine whether the trends can be expected to hold more generally. Subsets of the single-speaker Lwazi 3 corpus [23] and a recently developed multi-speaker TTS corpus were used [7] (these subsets exclude foreign language parts -see Table I). The results using RCRL, NEW0 and NEW1 display a similar trend, with the possible exception of the temporal measure -see Table III.…”

Section: B Evaluation On Additional Corporamentioning

confidence: 99%

“…For this experiment, two TTS voices were built as before using the Lwazi 2 corpus and RCRL and NEW1 dictionaries. The Lwazi 2 corpus was selected since the recording of small corpora is typical in TTS development for underresourced languages (see for example [6] and [7]). Unseen sentences (35 in total) were randomly selected from three sources: a few news articles 6 (16 sentences), the Universal Declaration of Human Rights 7 (9 sentences), and Wikipedia 8 (10 sentences).…”

Section: Subjective Evaluationmentioning

confidence: 99%

“…Furthermore, sparsity is a serious consideration during system design and development with limited training data (e.g. in TTS corpus development [6], [7]).…”

Section: Introductionmentioning

confidence: 99%

“…Articles were manually accessed during September 2016 from: http://www.netwerk24.com/7 From: http://www.unicode.org/udhr/d/udhr afr.html8 Sentences were taken from the Afrikaans corpus in[7].…”

mentioning

confidence: 99%

See 2 more Smart Citations

Evaluating acoustic modelling of lexical stress for Afrikaans speech synthesis

Niekerk

2017

2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech)

View full text Add to dashboard Cite

An explicit lexical stress feature is investigated for statistical parametric speech synthesis in Afrikaans: Firstly, objective measures are used to assess proposed annotation protocols and dictionaries compared to the baseline (implicit modelling) on the Lwazi 2 text-to-speech corpus. Secondly, the best candidates are evaluated on additional corpora. Finally, a comparative subjective evaluation is conducted to determine the perceptual impact on text-to-speech synthesis. The best candidate dictionary is associated with favourable objective results obtained on all corpora and was preferred in the subjective test. This suggests that it may form a basis for further refinement and work on improved prosodic models.

show abstract