Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Lei, Shun; Zhou, Yixuan; Chen, Liyang; Wu, Zhiyong; Kang, Shiyin; Meng, Helen

doi:10.1109/icassp49357.2023.10095866

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

2024

Publication Types

Select...

Other1

Article1

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

Guo,

Zhu,

Xue

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

Guo,

Zhu,

Xue

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Barakat,

Turk,

Demiroglu

2024

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.

show abstract

Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Cited by 2 publications

References 14 publications

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Contact Info

Product

Resources

About