Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

Matsubara, Keisuke; Okamoto, Toshio; Takashima, Ryoichi; Takiguchi, Tetsuya; Toda, Tomoki; Kawai, Hisashi

doi:10.1109/taslp.2023.3275032

Cited by 4 publications

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Yamashita,

Okamoto,

Takashima

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additionally, HiFi-GAN is widely used not only for TTS but also for many speech and audio applications. To accelerate HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed. Although inverse short-term Fourier transform (iSTFT)-based fast upsampling is introduced in iSTFTNet and MS-iSTFT-HiFi-GAN, we first find that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT. To further improve the synthesis quality and inference speed, we propose FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable fully-connected (FC) layer-based fast upsampling without overlapadd operation instead of the iSTFT layer. The experimental results for unseen speaker synthesis and E2E TTS conditions show that the proposed methods can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN. Therefore, the iSTFT layer can be replaced by the proposed trainable FC layer-based upsampling without overlap-add operation in HiFi-GAN-based neural vocoders.INDEX TERMS End-to-end text-to-speech, fully-connected layer-based upsampling, iSTFTNet, Multistream HiFi-GAN, neural vocoder

show abstract