Alexandra Torresquintero scite author profile

Alexandra Torresquintero

5Publications

34Citation Statements Received

62Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Phonological Features for 0-Shot Multilingual Speech Synthesis

Staib¹,

Teh²,

Torresquintero³

et al. 2020

View full text Add to dashboard Cite

Code-switching-the intra-utterance use of multiple languages-is prevalent across the world. Within text-tospeech (TTS), multilingual models have been found to enable code-switching [1][2][3]. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological features derived from the International Phonetic Alphabet (IPA), such as vowel height and frontness, consonant place and manner. This allows the model topology to stay unchanged for different languages, and enables new, previously unseen feature combinations to be interpreted by the model. We show that this allows us to generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.

show abstract

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Mohan¹,

Hu²,

Teh³

et al. 2021

View full text Add to dashboard Cite

Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced.Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: F0, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified.Compared to a model that employs a variational autoencoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent humanin-the-loop modification of the predicted acoustic features can significantly further increase naturalness.

show abstract

Incremental Text to Speech for Neural Sequence-to-Sequence Models Using Reinforcement Learning

Mohan¹,

Lenain²,

Foglianti³

et al. 2020

View full text Add to dashboard Cite

ADEPT: A Dataset for Evaluating Prosody Transfer

Torresquintero¹,

Teh²,

Wallis³

et al. 2021

View full text Add to dashboard Cite

Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for evaluating prosody transfer. The samples include global variations reflecting emotion and interpersonal attitude, and local variations reflecting topical emphasis, propositional attitude, syntactic phrasing and marked tonicity. The corpus only includes prosodic variations that listeners are able to distinguish with reasonable accuracy, and we report these figures as a benchmark against which text-to-speech prosody transfer can be compared. We conclude the paper with a demonstration of our proposed evaluation methodology, using the corpus to evaluate two textto-speech models that perform prosody transfer.

show abstract

Ensemble Prosody Prediction For Expressive Speech Synthesis

Teh¹,

Hu²,

Mohan³

et al. 2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.