Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Zhang, Yajie; Pan, Shifeng; He, Lei; Ling, Zhen-Hua

doi:10.1109/icassp.2019.8683623

Cited by 212 publications

(159 citation statements)

References 10 publications

Supporting

Mentioning

155

Contrasting

Unclassified

Order By: Relevance

“…In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors such as speaker identity, noise, recording channels, and prosody [22], as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition [64], prosody transfer [77,87], speaker verification [66], speech synthesis [31,77], and voice conversion [32], among other applications.…”

Section: Learning Disentangled Representationmentioning

confidence: 99%

Privacy-preserving Voice Analysis via Disentangled Representations

Aloufi

Haddadi

Boyle

2020

Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop

View full text Add to dashboard Cite

Voice User Interfaces (VUIs) are increasingly popular and built into smartphones, home assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient user experience, VUIs raise new security and privacy concerns for their users. In this paper, we focus on attribute inference attacks in the speech domain, demonstrating the potential for an attacker to accurately infer a target user's sensitive and private attributes (e.g. their emotion, sex, or health status) from deep acoustic models. To defend against this class of attacks, we design, implement, and evaluate a user-configurable, privacy-aware framework for optimizing speechrelated data sharing mechanisms. Our objective is to enable primary tasks such as speech recognition and user identification, while removing sensitive attributes in the raw speech data before sharing it with a cloud service provider. We leverage disentangled representation learning to explicitly learn independent factors in the raw data. Based on a user's preferences, a supervision signal informs the filtering out of invariant factors while retaining the factors reflected in the selected preference. Our experimental evaluation over five datasets shows that the proposed framework can effectively defend against attribute inference attacks by reducing their success rates to approximately that of guessing at random, while maintaining accuracy in excess of 99% for the tasks of interest. We conclude that negotiable privacy settings enabled by disentangled representations can bring new opportunities for privacy-preserving applications.

show abstract

Section: Learning Disentangled Representationmentioning

confidence: 99%

Privacy-preserving Voice Analysis via Disentangled Representations

Aloufi

Haddadi

Boyle

2020

Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop

View full text Add to dashboard Cite

show abstract

“…Some researchers make some progress to use a reference encoder to capture prosody information from audios by several feature learning techniques [4][5] [6][7] [8]. The above models can transfer the prosody from reference audio to the audios to be synthesised.…”

Section: Graph Auxiliary Encodermentioning

confidence: 99%

GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech

Sun

Wang

Cheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper leverages the graph-to-sequence method in neural text-to-speech (GraphTTS), which maps the graph embedding of the input sequence to spectrograms. The graphical inputs consist of node and edge representations constructed from input texts. The encoding of these graphical inputs incorporates syntax information by a GNN encoder module. Besides, applying the encoder of GraphTTS as a graph auxiliary encoder (GAE) can analyse prosody information from the semantic structure of texts. This can remove the manual selection of reference audios process and makes prosody modelling an end-to-end procedure. Experimental analysis shows that GraphTTS outperforms the state-of-theart sequence-to-sequence models by 0.24 in Mean Opinion Score (MOS). GAE can adjust the pause, ventilation and tones of synthesised audios automatically. This experimental conclusion may give some inspiration to researchers working on improving speech synthesis prosody.

show abstract

“…For example, some researchers implemented open clones of Tacotron [66][67][68] to reproduce the speech of satisfactory quality as clear as the original work [69]. The authors in [70] introduced deep generative models, such as Variational Auto-encoder (VAE) [71], to Tacotron to explicitly model the latent representation of a speaker state in a continuous space, and additionally to control the speaking style in speech synthesis [70].…”

Section: Speech Synthesis Based On Tacotronmentioning

confidence: 99%

A Review of Deep Learning Based Speech Synthesis

Ning

et al. 2019

Applied Sciences

117

View full text Add to dashboard Cite

Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more attention. Recent advances on speech synthesis are overwhelmingly contributed by deep learning or even end-to-end techniques which have been utilized to enhance a wide range of application scenarios such as intelligent speech interaction, chatbot or conversational artificial intelligence (AI). For speech synthesis, deep learning based techniques can leverage a large scale of <text, speech> pairs to learn effective feature representations to bridge the gap between text and speech, thus better characterizing the properties of events. To better understand the research dynamics in the speech synthesis field, this paper firstly introduces the traditional speech synthesis methods and highlights the importance of the acoustic modeling from the composition of the statistical parametric speech synthesis (SPSS) system. It then gives an overview of the advances on deep learning based speech synthesis, including the end-to-end approaches which have achieved start-of-the-art performance in recent years. Finally, it discusses the problems of the deep learning methods for speech synthesis, and also points out some appealing research directions that can bring the speech synthesis research into a new frontier.

show abstract

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Cited by 212 publications

References 10 publications

Privacy-preserving Voice Analysis via Disentangled Representations

Privacy-preserving Voice Analysis via Disentangled Representations

GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech

A Review of Deep Learning Based Speech Synthesis

Contact Info

Product

Resources

About