Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1705
|View full text |Cite
|
Sign up to set email alerts
|

High Quality, Lightweight and Adaptable TTS Using LPCNet

Abstract: We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU.The modular setup of the system allows for simple adaptation to new voices with a small amount of data.We first demonstrate the ability of the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4
2

Relationship

2
8

Authors

Journals

citations
Cited by 38 publications
(26 citation statements)
references
References 12 publications
0
26
0
Order By: Relevance
“…In the [TEXT] setting, unlike the other two settings described above, we assume that transcripts and semantic labels are available but corresponding audio is absent. To circumvent the lack of real speech, as in [8,9], we use a TTS system [31] to synthesize speech from the transcripts and then use the speech to perform SLU training. Given that synthesized speech can be quite different from the test audio, we 4).…”
Section: Slu Models In the [Text] Settingmentioning
confidence: 99%
“…In the [TEXT] setting, unlike the other two settings described above, we assume that transcripts and semantic labels are available but corresponding audio is absent. To circumvent the lack of real speech, as in [8,9], we use a TTS system [31] to synthesize speech from the transcripts and then use the speech to perform SLU training. Given that synthesized speech can be quite different from the test audio, we 4).…”
Section: Slu Models In the [Text] Settingmentioning
confidence: 99%
“…The TTS system architecture is similar to the single speaker system described in [25]. It is a modular system based on three neural-net models: one to infer prosody, one to infer acoustic features, and an LPCNet [26] vocoder.…”
Section: Tts Systemmentioning
confidence: 99%
“…Previous studies on few-shot TTS could be categorized into two general approaches. The first approach pre-tains multi-speaker TTS models on a large multi-speaker dataset and then fine-tunes the models on a small dataset of target speaker [7,8]. The second approach predicts a speaker embedding from speech to clone unseen speakers without fine-tuning.…”
Section: Introductionmentioning
confidence: 99%