Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1078
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

Abstract: Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular ch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
28
0
1

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 28 publications
(29 citation statements)
references
References 28 publications
0
28
0
1
Order By: Relevance
“…This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically electromagnetic articulography (EMA) [2,3,4,5], ultrasound tongue imaging (UTI) [6,7,8,9,10,11,12,13], permanent magnetic articulography (PMA) [14,15], surface electromyography (sEMG) [16,17,18], Non-Audible Murmur (NAM) [19] or video of the lip movements [7,20] are used.…”
Section: Introductionmentioning
confidence: 99%
“…This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically electromagnetic articulography (EMA) [2,3,4,5], ultrasound tongue imaging (UTI) [6,7,8,9,10,11,12,13], permanent magnetic articulography (PMA) [14,15], surface electromyography (sEMG) [16,17,18], Non-Audible Murmur (NAM) [19] or video of the lip movements [7,20] are used.…”
Section: Introductionmentioning
confidence: 99%
“…Deep autoencoders were used in [273], [274] to extract features from ultrasound images, achieving significant gains in both silent ASR and direct synthesis. In [275], multitask learning of speech recognition and synthesis parameters was evaluated in the context of an ultrasoundbased SSI system designed to enhance the performance of individual tasks. The proposed method used a DNN-based mapping which was trained to simultaneously optimise two loss functions: an ASR loss, aiming at recognising phonetic units (corresponding to the states of an HMM-DNN recogniser) from the input articulatory features; and a speech synthesis loss, which predicted a set of acoustic parameters from the input features.…”
Section: ) Imaging Techniquesmentioning
confidence: 99%
“…If we follow the approach of our previous studies (see e.g. [7]- [9]), and also feed the feature vectors of the neighbouring images from the video into the network, we can also apply a wider sliding window without increasing the overall size of the network. Fig.…”
Section: B Spectral Parameter Estimation By Autoencoder Neural Networkmentioning
confidence: 99%