Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

Csapó, Tamás Gábor; Zainkó, Csaba; Tóth, László; Gosztolya, Gábor; Markó, Alexandra

doi:10.21437/interspeech.2020-1031

Cited by 24 publications

(42 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [33], word recognition rates for a set of simple commands in automated listening tests were about 60%. Finally, the generative vocoders Wavenet [15] and WaveGlow [36], have been used, respectively, in [16] (2018), to produce single-word speech outputs from ECoG brain implant waveforms; and in [19], to synthesize a set of test sentences from ultrasound images of the tongue.…”

Section: Related Workmentioning

confidence: 99%

“…In these last examples [33] [16] [19], as well as in preliminary tests of our own using Griffin-Lim [37], the spectrograms predicted from sensor data, although globally correct, lack detailed harmonic structure, giving rise to speech that, while interesting, has only moderate intelligibility. Apparently, the new vocoders, however powerful, cannot compensate for shortcomings encountered in the acoustic parameter prediction phase.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, so-called "neural" vocoders have begun to appear as an alternative to source-filter synthesizers, sometimes involving the use of the Generative Adversarial Networks (GAN) [15] [16] [17] that are now widely used in generation tasks [18]. Applications of neural vocoders to multimodal speech synthesis have begun to appear [16] [19]; however, results to date, although interesting, remain preliminary.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

2021

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

2021

View full text Add to dashboard Cite

“…For the articulatory-to-acoustic conversion task, typically electromagnetic articulography [25], ultrasound tongue imaging [6,7], permanent magnetic articulography [11], surface electromyography [13], magnetic resonance imaging [5] or video of the lip movements [16,10,3,19,17,23,22,24] are used. Lip-to-speech synthesis can be solved in two different ways: 1) direct approach, meaning that speech is generated without an intermediate step from the input signal [16,10,3,19,17]; and 2) indirect approach, meaning that lip-to-text recognition is followed by text-to-speech synthesis [23,22,24].…”

Section: Lip-to-speech Conversionmentioning

confidence: 99%

Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend

Arthur¹,

Csapó²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques as input (e.g. ultrasound tongue imaging, MRI, lip video). The advantage of lip video is that it is easily available and affordable: most modern smartphones have a front camera. There are already a few solutions for lip-to-speech synthesis, but they mostly concentrate on offline training and inference. In this paper, we propose a system built from a backend for deep neural network training and inference and a fronted as a form of a mobile application. Our initial evaluation shows that the scenario is feasible: a top-5 classification accuracy of 74% is combined with feedback from the mobile application user, making sure that the speaking impaired might be able to communicate with this solution.

show abstract

“…For AAM, one potential biosignal is ultrasound tongue imaging [6,7,8,9]. For the articulatory-to-acoustic conversion, typically, traditional [8] or neural vocoders [9] are used, which synthesize speech from the spectral parameters predicted by the DNNs from the articulatory input.…”

Section: Articulatory-to-acoustic Mappingmentioning

confidence: 99%

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Csapó¹,

Tóth²,

Gosztolya³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based textto-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text input. Besides, we analyze the ultrasound tongue recordings of several speakers, and show that misalignments in the ultrasound transducer positioning can have a negative effect on the final synthesis performance.

show abstract

Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

Cited by 24 publications

References 27 publications

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Contact Info

Product

Resources

About