Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Kawamura, Masaya; Shirahata, Yuma; Yamamoto, Ryōichi; Tachibana, Kentaro

doi:10.1109/icassp49357.2023.10095296

Cited by 6 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These studies have been summarized to enhance acoustic models [16] (make acoustic representations from input text) and neural vocoders [17] (convert these representations to waveforms). However, different optimizations of the two models limit the execution of TTS systems [18]. Moreover, trade-offs exist for the computational cost, inference speech, and synthesized speech quality [19].…”

Section: Introductionmentioning

confidence: 99%

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

et al. 2023

View full text Add to dashboard Cite

Oil refineries have high operating expenses and are often exposed to increased asset integrity risks and functional failure. Real-time monitoring of their operations has always been critical to ensuring safety and efficiency. We proposed a novel Industrial Internet of Things (IIoT) design that employs a text-to-speech synthesizer (TTS) based on neural networks to build an intelligent extension control system. We enhanced a TTS model to achieve high inference speed by employing HiFi-GAN V3 vocoder in the acoustic model FastSpeech 2. We experimented with our system on a low resources-embedded system in a real-time environment. Moreover, we customized the TTS model to generate two target speakers (female and male) using a small dataset. We performed an ablation analysis by conducting experiments to evaluate the performance of our design (IoT connectivity, memory usage, inference speed, and output speech quality). The results demonstrated that our system Real-Time Factor (RTF) is 6.4 (without deploying the cache mechanism, which is a technique to call the previously synthesized speech sentences in our system memory). Using the cache mechanism, our proposed model successfully runs on a low-resource computational device with real-time speed (RTF equals 0.16, 0.19, and 0.29 when the memory has 250, 500, and 1000 WAV files, respectively). Additionally, applying the cache mechanism has reduced memory usage percentage from 16.3% (for synthesizing a sentence of ten seconds) to 6.3%. Furthermore, according to the objective speech quality evaluation, our TTS model is superior to the baseline TTS model.

show abstract

Section: Introductionmentioning

confidence: 99%

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

et al. 2023

View full text Add to dashboard Cite

show abstract

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Yamashita,

Okamoto,

Takashima

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additionally, HiFi-GAN is widely used not only for TTS but also for many speech and audio applications. To accelerate HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed. Although inverse short-term Fourier transform (iSTFT)-based fast upsampling is introduced in iSTFTNet and MS-iSTFT-HiFi-GAN, we first find that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT. To further improve the synthesis quality and inference speed, we propose FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable fully-connected (FC) layer-based fast upsampling without overlapadd operation instead of the iSTFT layer. The experimental results for unseen speaker synthesis and E2E TTS conditions show that the proposed methods can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN. Therefore, the iSTFT layer can be replaced by the proposed trainable FC layer-based upsampling without overlap-add operation in HiFi-GAN-based neural vocoders.INDEX TERMS End-to-end text-to-speech, fully-connected layer-based upsampling, iSTFTNet, Multistream HiFi-GAN, neural vocoder

show abstract

QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion

Guo,

Liu,

Ishi

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Cited by 6 publications

References 15 publications

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster Conversion

Contact Info

Product

Resources

About