Bartosz Putrycz scite author profile

Bartosz Putrycz

5Publications

98Citation Statements Received

113Citation Statements Given

How they've been cited

139

How they cite others

113

Affiliations

Amazon (Germany), Amazon (United States)

Publications

Order By: Most citations

Towards Achieving Robust Universal Neural Vocoding

Lorenzo-Trueba¹,

Drugman²,

Latorre³

et al. 2019

View full text Add to dashboard Cite

This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

show abstract

Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information

Klimkov

Nadolski

Moinet

et al. 2017

View full text Add to dashboard Cite

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Shah¹,

Pokora²,

Ezzerg³

et al. 2021

View full text Add to dashboard Cite

Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work [1], a 3step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (≈ 10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a nonautoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

show abstract

Universal Neural Vocoding with Parallel Wavenet

Jiao

Gabryś

Tinchev

et al. 2021

View full text Add to dashboard Cite

We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time highquality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We show that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall. We also show that the proposed vocoder outperforms several existing neural vocoder architectures in terms of naturalness and universality. These findings are consistent when we further test on more than 300 open-source voices.

show abstract

Towards achieving robust universal neural vocoding

Lorenzo-Trueba¹,

Drugman²,

Latorre³

et al. 2018

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Bartosz Putrycz

Towards Achieving Robust Universal Neural Vocoding

Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Universal Neural Vocoding with Parallel Wavenet

Towards achieving robust universal neural vocoding

Contact Info

Product

Resources

About