Opti-speech: a real-time, 3d visual feedback system for speech training

Katz, William F.; Campbell, Thomas F.; Wang, Jun; Farrar, Eric; Eubanks, James Coleman; Balasubramanian, Arvind; Prabhakaran, Balakrishnan; Rennaker, Rob

doi:10.21437/interspeech.2014-298

Cited by 18 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Articulatory movement prediction from text input can be useful for audiovisual speech synthesis. A specific application is computer-assisted pronunciation training / computer-aided language learning [26,27,28], which can be beneficial for learners of second languages. With such a combined TTS and text-to-articulatory prediction system, by giving an arbitrary input text, one is able to hear the speech and, in synchrony with it, see how to move the tongue in 2D or 3D to produce target speech sounds.…”

Section: Discussionmentioning

confidence: 99%

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Csapó¹

2021

Preprint

View full text Add to dashboard Cite

In this paper, we present our first experiments in text-toarticulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.

show abstract

Section: Discussionmentioning

confidence: 99%

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Csapó¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…As pointed out in Section 1, the results in AAI might be useful for speech recognition [2], synthesis [3], talking heads [4], and for pronunciation training and language tutoring [5].…”

Section: Discussionmentioning

confidence: 99%

“…Recently, there has been a significant interest in AAI, because learning the correlation between articulation and acoustics could improve the performance of several tasks such as speech recognition [2], synthesis [3] and talking heads [4]. It can help the visualization of speech production as 3D articulatory animations for pronunciation training and language tutoring [5].…”

Section: Introductionmentioning

confidence: 99%

Speaker Dependent Acoustic-to-Articulatory Inversion Using Real-Time MRI of the Vocal Tract

Csapó¹

2020

Interspeech 2020

View full text Add to dashboard Cite

Acoustic-to-articulatory inversion (AAI) methods estimate articulatory movements from the acoustic speech signal, which can be useful in several tasks such as speech recognition, synthesis, talking heads and language tutoring. Most earlier inversion studies are based on point-tracking articulatory techniques (e.g. EMA or XRMB). The advantage of rtMRI is that it provides dynamic information about the full midsagittal plane of the upper airway, with a high 'relative' spatial resolution. In this work, we estimated midsagittal rtMRI images of the vocal tract for speaker dependent AAI, using MGC-LSP spectral features as input. We applied FC-DNNs, CNNs and recurrent neural networks, and have shown that LSTMs are the most suitable for this task. As objective evaluation we measured normalized MSE, Structural Similarity Index (SSIM) and its complex wavelet version (CW-SSIM). The results indicate that the combination of FC-DNNs and LSTMs can achieve smooth generated MR images of the vocal tract, which are similar to the original MRI recordings (average CW-SSIM: 0.94).

show abstract

“…Such geometrical models have been successfully used in previous work to generate animations from provided articulatory data: Katz et al [25] presented a real-time visual feedback system that deforms a generic tongue model using EMA data. However, due to the generic nature of the model, their approach did not take anatomical differences into account.…”

Section: A Backgroundmentioning

confidence: 99%

Synthesis of Tongue Motion and Acoustics From Text Using a Multimodal Articulatory Database

Steiner

Maguer

Hewer

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.

show abstract

Opti-speech: a real-time, 3d visual feedback system for speech training

Cited by 18 publications

References 17 publications

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Speaker Dependent Acoustic-to-Articulatory Inversion Using Real-Time MRI of the Vocal Tract

Synthesis of Tongue Motion and Acoustics From Text Using a Multimodal Articulatory Database

Contact Info

Product

Resources

About