FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction

Tian, Qiao; Zhang, Zewang; Lu, Hanqing; Chen, Linghui; Liu, Shan

doi:10.21437/interspeech.2020-1156

Cited by 15 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another technique that is widely used to speed up the inference of vocoders is subband modeling, which divides the waveform into multiple subbands for fast inference. Typical models include DurIAN [411], multi-band MelGAN [400], subband WaveNet [244], and multi-band LPCNet [342]. Bunched LPCNet [364] reduces the computation complexity of LPCNet with sample bunching and bit bunching, achieving more than 2x speedup.…”

Section: Adaptivementioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models, and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

show abstract

Section: Adaptivementioning

confidence: 99%

A Survey on Neural Speech Synthesis

Tan,

Qin,

Soong

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Consequently, Full-band LPCNet is the only neural vocoder that can realize real-time and high-fidelity speech synthesis with a sampling frequency of 48 kHz using a CPU. As future work, Full-band LPCNet can be made much faster by applying acceleration methods, such as the subband [20], [33], [50], [52], sample bunching [51] and tensor decomposition [48] methods. Additionally, Full-band LPCNet can be extended to multi-speaker neural vocoder to synthesize the speech waveforms of many and unspecified speakers that were not included in training [75].…”

Section: ) Subjective Evaluationmentioning

confidence: 99%

“…In singing voice synthesis, we found it necessary to adjust the batch length of the input features appropriately. Although we performed only a simple extension for fullband synthesis in this study, acceleration methods such as subband [20], [33], [50], [52], sample bunching [51], and tensor decomposition [48] methods can be directly applied to Full-band LPCNet to further improve the synthesis speed.…”

Section: Introductionmentioning

confidence: 99%

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

et al. 2021

View full text Add to dashboard Cite

This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they have some drawbacks regarding synthesis quality or inference speed. LPCNet was proposed as a real-time neural vocoder with a mobile CPU but its sampling frequency is still only 16 kHz. In this paper, we propose a Full-band LPCNet to synthesize high-fidelity 48 kHz speech waveforms with a CPU by introducing some simple but effective modifications to the conventional LPCNet. We then evaluate the synthesis quality using both normal speech and a singing voice. The results of these experiments demonstrate that the proposed Full-band LPCNet is the only neural vocoder that can synthesize high-quality 48 kHz speech waveforms while maintaining real-time capability with a CPU.

show abstract

“…In contrast to real-time autoregressive neural vocoders such as WaveRNN [4], LPCNet [5], and FeatherWave [6], non-autoregressive models, which simultaneously synthesize all speech waveform samples, can be easily implemented as real-time neural vocoders, and many models have been investigated. Non-autoregressive neural vocoders are broadly categorized into two types.…”

Section: Introductionmentioning

confidence: 99%

Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders

Okamoto

Toda

Shiga

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Although diffusion probabilistic vocoders WaveGrad and DiffWave can realize real-time high-fidelity speech synthesis with a simple loss function in training, all noise components with over the full range of noise levels are predicted by one model in all iterations. This paper proposes a simple but effective noise level-limited sub-modeling framework for diffusion probabilistic vocoders Sub-WaveGrad and Sub-DiffWave. In the proposed method, DiffWave conditioned on a continuous noise level like WaveGrad, and spectral enhancement post-filtering are also provided. The proposed Sub-WaveGrad and Sub-DiffWave models are realized using 10 sub-models. These models are separately trained with different noise level limits, and only necessary sub-models are used according to the noise schedule during inference. The results of experiments using a Japanese female speech corpus indicate that both the proposed Sub-WaveGrad and Sub-DiffWave outperform vanilla WaveGrad and DiffWave in terms of the model accuracy and synthesis quality while retaining the inference speed.

show abstract

FeatherWave: An Efficient High-Fidelity Neural Vocoder with Multi-Band Linear Prediction

Cited by 15 publications

References 22 publications

A Survey on Neural Speech Synthesis

A Survey on Neural Speech Synthesis

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

Noise Level Limited Sub-Modeling for Diffusion Probabilistic Vocoders

Contact Info

Product

Resources

About