ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053704
|View full text |Cite
|
Sign up to set email alerts
|

Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network

Abstract: In this paper, we propose an improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN). The recently proposed LPCNet vocoder has successfully achieved high-quality and lightweight speech synthesis systems by combining a vocal tract LP filter with a WaveRNN-based vocal source (i.e., excitation) generator. However, the quality of synthesized speech is often unstable because the vocal source component is insufficiently represented by the µ-law quantization method, and the mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 12 publications
0
7
0
Order By: Relevance
“…In typical lightweight neural vocoder where small model is adopted, it is necessary to adjust the sharpness of the output distributions to avoid noise caused by the random sampling process and achieve better quality. In FFTNet [18] and iLPCNet [19], lowering temperature in the voiced region with a constant factor is exploited for such purpose. Rather than using voiced information, LPCNet adopts pitch correlation to adjust the temperature factor.…”
Section: Generation Methodsmentioning
confidence: 99%
“…In typical lightweight neural vocoder where small model is adopted, it is necessary to adjust the sharpness of the output distributions to avoid noise caused by the random sampling process and achieve better quality. In FFTNet [18] and iLPCNet [19], lowering temperature in the voiced region with a constant factor is exploited for such purpose. Rather than using voiced information, LPCNet adopts pitch correlation to adjust the temperature factor.…”
Section: Generation Methodsmentioning
confidence: 99%
“…And we integrate frequencies from 0 to 24kHz into 50-dimensional filter banks, finer than the original Bark scale. of 16 kHz and some subsequent research has produced a 24 kHz LPCNet with higher fidelity synthesis [44], [47], [53], [54]. As described in Section I, we propose a Full-band LPCNet by introducing the following simple but effective modifications to synthesize high-fidelity speech waveforms with a sampling frequency of 48 kHz , which can cover the entire speech waveform and human auditory frequency ranges, using a CPU.…”
Section: Input Featuresmentioning
confidence: 99%
“…The created cascading network-based created system attains higher results compared to traditional structures. (M. Hwang et al, 2020) [22] improving the LPCNet vocoder performance by applying the linear prediction structured mixture density networks. In this process, an autoregressive neural vocoder examines the vocal source and vocal tract components.…”
Section: Background Studymentioning
confidence: 99%