Zvi Kons scite author profile

Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-ofthe-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the textto-intent data is converted into speech-to-intent data using a multispeaker text-to-speech system. The proposed approaches recover 80% of performance lost due to using limited intent-labeled speech.

show abstract

High Quality, Lightweight and Adaptable TTS Using LPCNet

Kons

Shechtman

Sorin

et al. 2019

View full text Add to dashboard Cite

We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU.The modular setup of the system allows for simple adaptation to new voices with a small amount of data.We first demonstrate the ability of the system to produce high quality speech when trained on large, high quality datasets. Following that, we demonstrate its adaptability by mimicking unseen voices using 5 to 20 minutes long datasets with lower recording quality. Large scale Mean Opinion Score quality and similarity tests are presented, showing that the system can adapt to unseen voices with quality gap of 0.12 and similarity gap of 3% compared to natural speech for male voices and quality gap of 0.35 and similarity of gap of 9 % for female voices.

show abstract

End-to-End Spoken Language Understanding Without Full Transcripts

Kuo

Tüske

Thomas

et al. 2020

View full text Add to dashboard Cite

Neural TTS Voice Conversion

Kons¹,

Shechtman²,

Sorin³

et al. 2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zvi Kons

An autonomous debating system

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

High Quality, Lightweight and Adaptable TTS Using LPCNet

End-to-End Spoken Language Understanding Without Full Transcripts

Neural TTS Voice Conversion

Contact Info

Product

Resources

About