Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.507
|View full text |Cite
|
Sign up to set email alerts
|

Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization

Abstract: This paper describes the motivation and development of speech synthesis systems for the purposes of language revitalization. By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanien'kéha, Gitksan & SENĆOŦEN, we re-evaluate the question of how much data is required to build low-resource speech synthesis systems featuring state-of-the-art neural models. For example, preliminary results with English data show that a FastSpeech2 model trained with 1 hour of training data can pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 12 publications
0
4
0
Order By: Relevance
“…We train 7 different FastPitch [12] acoustic models using either 2 or 8 hours of speech to predict mel-scaled spectrogram features from character, phone or discrete acoustic unit sequences, as described in the following sections. Following [21,1], we replace all convolutional layers with depthwise-separable convolutions, reducing overall parameter counts to match our lowdata setting. Character-and phone-input models are trained with target durations from forced alignments, while for acoustic unit sequences we derive target durations by run-length encoding repeated consecutive units.…”
Section: Model Specificationmentioning
confidence: 99%
See 2 more Smart Citations
“…We train 7 different FastPitch [12] acoustic models using either 2 or 8 hours of speech to predict mel-scaled spectrogram features from character, phone or discrete acoustic unit sequences, as described in the following sections. Following [21,1], we replace all convolutional layers with depthwise-separable convolutions, reducing overall parameter counts to match our lowdata setting. Character-and phone-input models are trained with target durations from forced alignments, while for acoustic unit sequences we derive target durations by run-length encoding repeated consecutive units.…”
Section: Model Specificationmentioning
confidence: 99%
“…Data requirements for neural TTS have typically been put at some tens of hours of studio-quality speech recordings paired with text transcripts, although recent work has reevaluated these assumptions by switching to non-autoregressive architectures where the burden of learning text-speech alignments alongside acoustic feature prediction is removed [1], or by using powerful self-supervised speech representations to help train some parts of the system on noisier audio data [2]. Much effort has also been directed toward using 'found' data not originally intended for TTS, especially audiobook recordings.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Although such finetuning is obviously still advantageous, another more likely scenario for LRLs is that researchers could find a few enthusiastic listeners that would not mind rating many samples, rather than the other way around. Besides, the work by [26] shows that, for LRLs with very small communities, the evaluation of a few community-engaged and respected speakers of the language can be representative of that of the whole community.…”
Section: Model Training and Predictionmentioning
confidence: 99%