Polyphone Disambiguation and Accent Prediction Using Pre-Trained Language Models in Japanese TTS Front-End

Hida, Rem; Hamada, Masaki; Kamada, Chie; Tsunoo, Emiru; Sekiya, Toshiyuki; Kumakura, Toshiyuki

doi:10.1109/icassp43922.2022.9746212

Cited by 3 publications

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(1) Parallel corpus of different accents of the same speaker using source and target speech content and time alignment (Finkelstein et al, 2022;Liu et al, 2022;Hida et al, 2022;Toda et al, 2007;Oyamada et al, 2017). (2) Non-parallel corpus of * Corresponding author multiple speakers with multiple accents using inconsistent source and target speech content (Wang et al, 2021;Zhao et al, 2018Zhao et al, , 2019Kaneko et al, 2019Kaneko et al, , 2020aKaneko et al, , 2021Finkelstein et al, 2022) used a multi-stage trained tts model to achieve transfer of North American accents, Australian accents, and British accents, and used a CHiVE-BERT pre-training model to enhance the audio effect of accent generation.…”

Section: Introductionmentioning

confidence: 99%

Non-parallel Accent Transfer based on Fine-grained Controllable Accent Modelling

Wang,

Yu,

Yang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Existing accent transfer works rely on parallel data or speech recognition models. This paper focuses on the practical application of accent transfer and aims to implement accent transfer using non-parallel datasets. The study has encountered the challenge of speech representation disentanglement and modeling accents. In our accent modeling transfer framework, we manage to solve these problems by two proposed methods. First, we learn the suprasegmental information associated with tone to finely model the accents in terms of tone and rhythm. Second, we propose to use mutual information learning to disentangle the accent features and control the accent of the generated speech during the inference time. Experiments show that the proposed framework attains superior performance to the baseline models in terms of accentedness and audio quality.

show abstract