Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1031
|View full text |Cite
|
Sign up to set email alerts
|

Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

Abstract: For articulatory-to-acoustic mapping using deep neural networks, typically spectral and excitation parameters of vocoders have been used as the training targets. However, vocoding often results in buzzy and muffled final speech quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a large amount of English and Hungarian speech data. The inputs of the convolutional neural network are ultrasound tongue images. The t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0
5

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
2

Relationship

3
4

Authors

Journals

citations
Cited by 24 publications
(42 citation statements)
references
References 27 publications
0
37
0
5
Order By: Relevance
“…In [33], word recognition rates for a set of simple commands in automated listening tests were about 60%. Finally, the generative vocoders Wavenet [15] and WaveGlow [36], have been used, respectively, in [16] (2018), to produce single-word speech outputs from ECoG brain implant waveforms; and in [19], to synthesize a set of test sentences from ultrasound images of the tongue.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In [33], word recognition rates for a set of simple commands in automated listening tests were about 60%. Finally, the generative vocoders Wavenet [15] and WaveGlow [36], have been used, respectively, in [16] (2018), to produce single-word speech outputs from ECoG brain implant waveforms; and in [19], to synthesize a set of test sentences from ultrasound images of the tongue.…”
Section: Related Workmentioning
confidence: 99%
“…In these last examples [33] [16] [19], as well as in preliminary tests of our own using Griffin-Lim [37], the spectrograms predicted from sensor data, although globally correct, lack detailed harmonic structure, giving rise to speech that, while interesting, has only moderate intelligibility. Apparently, the new vocoders, however powerful, cannot compensate for shortcomings encountered in the acoustic parameter prediction phase.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For the articulatory-to-acoustic conversion task, typically electromagnetic articulography [25], ultrasound tongue imaging [6,7], permanent magnetic articulography [11], surface electromyography [13], magnetic resonance imaging [5] or video of the lip movements [16,10,3,19,17,23,22,24] are used. Lip-to-speech synthesis can be solved in two different ways: 1) direct approach, meaning that speech is generated without an intermediate step from the input signal [16,10,3,19,17]; and 2) indirect approach, meaning that lip-to-text recognition is followed by text-to-speech synthesis [23,22,24].…”
Section: Lip-to-speech Conversionmentioning
confidence: 99%
“…For AAM, one potential biosignal is ultrasound tongue imaging [6,7,8,9]. For the articulatory-to-acoustic conversion, typically, traditional [8] or neural vocoders [9] are used, which synthesize speech from the spectral parameters predicted by the DNNs from the articulatory input.…”
Section: Articulatory-to-acoustic Mappingmentioning
confidence: 99%