We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. To extract model parameters, we use a minimally supervised method based on an image segmentation approach and a template fitting technique. Furthermore, we use image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation shows that, by limiting the degrees of freedom for the anatomical and speech related variations, to 5 and 4, respectively, we obtain a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate plausible tongue animation by tracking sparse motion capture data.
Abstract-We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.
The reliable estimation of the Lagrangian stress tensor from an image sequence is a challenging problem in mechanical engineering. Since this tensor involves first order motion derivatives, it appears tempting to estimate the optical flow field with a highly accurate variational model and compute its derivatives afterwards. In this paper we explain why this idea is inappropriate due to lower order smoothness assumptions and the ill-posedness of differentiation. As a remedy, we propose a variational framework that performs higher order regularisation of the optical flow field and directly computes the Lagrangian stress tensor from the image measurements. Due to its recursive structure, this framework is very generic. It can incorporate smoothness assumptions of arbitrary high order and allows to compute derivatives of any desired order in a stable way. With a biaxial tensile experiment with an elastomer we demonstrate that our novel approach gives substantially better results for the Lagrangian stress tensor than computing derivatives of the optical flow field. Moreover, it also outperforms a frequently used commercial software that marks the state-of-the-art for Lagrangian stress tensor computation.
We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.