Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

Chevi, Rendi; Prasojo, Radityo Eko; Aji, Alham Fikri

doi:10.48550/arxiv.2203.15643

Search citation statements

Order By: Relevance

Paper Sections

Select...

B Track 2: Lightweight Tts1

Introduction1

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

2024

Publication Types

Select...

Article1

Other1

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Light-Speech [24] uses neural architecture search to achieve 15X model compression, resulting in the final model with 1.8M parameters. Nix-TTS [25] builds an end-to-end TTS system with 5.23M using knowledge distillation. These previous works built models on a single-speaker dataset of LJSpeech.…”

Section: B Track 2: Lightweight Ttsmentioning

confidence: 99%

Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech

Singh,

Nagireddi,

Jayakumar

et al. 2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

show abstract

Section: B Track 2: Lightweight Ttsmentioning

confidence: 99%

Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech

Singh,

Nagireddi,

Jayakumar

et al. 2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

show abstract

“…Recent attempts to build on-device neural TTS include On-device TTS [7], LiteTTS [8], PortaSpeech [9], LightSpeech [10] and Nix-TTS [11]. On-device TTS is slow and resource intensive since it is a modified Tacotron2 for mel spectrogram generation and uses WaveRNN for vocoder.…”

Section: Introductionmentioning

confidence: 99%

EfficientSpeech: An On-Device Text to Speech Model

Atienza

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

show abstract

Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

Cited by 2 publications

References 0 publications

Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech

Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech

EfficientSpeech: An On-Device Text to Speech Model

Contact Info

Product

Resources

About