Prediction of Voicing and the F0 Contour from Electromagnetic Articulography Data for Articulation-to-Speech Synthesis

Stone, Simon; Schmidt, Philipp; Birkholz, Peter

doi:10.1109/icassp40776.2020.9053231

Cited by 2 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While deep EMA-to-speech models have been previously studied, as far as we are aware [35,36,37], current models are not highly intelligible, achieving a transcription WER of around 30% on open-vocabulary tasks [35]. In this work, we build an EMA-to-speech model that achieves a transcription WER of 18.5% and perform detailed error analyses on the synthesized utterances.…”

Section: Articulatory Synthesismentioning

confidence: 99%

Deep Speech Synthesis from Articulatory Representations

Wu¹,

Watanabe²,

Goldstein³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is computationally efficient and achieves a transcription word error rate (WER) of 18.5% for the EMA-to-speech task, yielding an improvement of 11.6% compared to prior work. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.

show abstract

Section: Articulatory Synthesismentioning

confidence: 99%

Deep Speech Synthesis from Articulatory Representations

Wu¹,

Watanabe²,

Goldstein³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…The results showed that an affine transformation can satisfactorily approximate the relation between the two speaking modes. More recently, in [245], pitch prediction (i.e., prediction of the speech voicing and fundamental frequency) from EMA data captured by six coils placed on the upper lip, the lower lip, the lower incisor, the tongue tip, the tongue body, and the tongue dorsum was investigated, achieving surprisingly good results despite EMA not capturing any information about the vibrations of the vocal folds.…”

Section: ) Magnetic Articulographymentioning

confidence: 99%

Silent Speech Interfaces for Speech Restoration: A Review

et al. 2020

View full text Add to dashboard Cite

This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative communication methods for persons with severe speech disorders. SSIs can employ a variety of biosignals to enable silent communication, such as electrophysiological recordings of neural activity, electromyographic (EMG) recordings of vocal tract movements or the direct tracking of articulator movements using imaging techniques. Depending on the disorder, some sensing techniques may be better suited than others to capture speech-related information. For instance, EMG and imaging techniques are well suited for laryngectomised patients, whose vocal tract remains almost intact but are unable to speak after the removal of the vocal folds, but fail for severely paralysed individuals. From the biosignals, SSIs decode the intended message, using automatic speech recognition or speech synthesis algorithms. Despite considerable advances in recent years, most present-day SSIs have only been validated in laboratory settings for healthy users. Thus, as discussed in this paper, a number of challenges remain to be addressed in future research before SSIs can be promoted to real-world applications. If these issues can be addressed successfully, future SSIs will improve the lives of persons with severe speech impairments by restoring their communication capabilities.

show abstract

Prediction of Voicing and the F0 Contour from Electromagnetic Articulography Data for Articulation-to-Speech Synthesis

Cited by 2 publications

References 12 publications

Deep Speech Synthesis from Articulatory Representations

Deep Speech Synthesis from Articulatory Representations

Silent Speech Interfaces for Speech Restoration: A Review

Contact Info

Product

Resources

About