Liumeng Xue scite author profile

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 text, audio pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.

show abstract

On the localness modeling for the self-attention based end-to-end speech synthesis

Yang

Kang

et al. 2020

Neural Networks

View full text Add to dashboard Cite

Building a Mixed-Lingual Neural TTS System with Only Monolingual Data

Xue¹,

Song²,

Xu³

et al. 2019

View full text Add to dashboard Cite

When deploying a Chinese neural Text-to-Speech (TTS) system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an average voice model which is built from multispeaker monolingual data, i.e., Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility, and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

show abstract

Building a controllable expressive speech synthesis system with multiple emotion strengths

Zhu

Xue

2020

Cognitive Systems Research

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Liumeng Xue

Controllable Emotion Transfer For End-to-End Speech Synthesis

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

On the localness modeling for the self-attention based end-to-end speech synthesis

Building a Mixed-Lingual Neural TTS System with Only Monolingual Data

Building a controllable expressive speech synthesis system with multiple emotion strengths

Contact Info

Product

Resources

About