Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks

Liu, Zheng-Chen; Ling, Zhen-Hua; Dai, Li-Rong

doi:10.21437/interspeech.2016-715

Cited by 22 publications

(17 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Articulatory-to-acoustic forward mapping: In AAF, forward mapping function is learned from articulatory movements to estimate acoustic features in a subject specific manner. Different methods have been proposed in literature for AAF, such as Gaussian mixture models (GMM) [31], hidden Markov models (HMM) [32], and deep neural networks (DNN) [14] and recurrent neural networks [33]. In [18], a comparison has been made across all the methods and it was shown that BLSTM performs better among the existing statistical methods.…”

Section: Proposed Approachmentioning

confidence: 99%

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Illa

Ghosh

2019

Interspeech 2019

View full text Add to dashboard Cite

Estimating speech representations from articulatory movements is known as articulatory-to-acoustic forward (AAF) mapping. Typically this mapping is learned using directly measured articulatory movement in a subject-specific manner. Such AAF mapping has been shown to benefit the speech synthesis applications. In this work, we investigate the speaker similarity and naturalness of utterances generated by AAF which is driven by the articulatory movements from a subject (referred to as cross speaker) different from the speaker (target speaker) used for training AAF mapping. Experiments are performed with directly measured articulatory data from 9 speakers (8 target speakers and 1 cross speaker), which are recorded using Electromagnetic articulograph AG501. Experiments are also performed with articulatory features estimated using speaker independent acoustic-to-articulatory inversion (SI-AAI) model trained on 26 reference speakers. Objective evaluation on target speakers reveal that the articulatory features estimated from SI-AAI result in a lower Mel-cepstrum distortion compared to that using directly measured articulatory features. Further, listening tests reveal that the directly measured articulatory movements preserve the speaker similarity better than estimated ones. Although, for naturalness, articulatory movements predicted by SI-AAI perform better than the direct measurements.

show abstract

Section: Proposed Approachmentioning

confidence: 99%

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Illa

Ghosh

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…the result of this step is text); this step is then followed by text-to-speech (TTS) synthesis [2,3,7,13,14,18]. In the SSR+TTS approach, any information related to speech prosody is totally lost, while several studies have shown that certain prosodic components may be estimated reasonably well from the articulatory recordings (e.g., pitch [11,16,22,23]). Also, the smaller delay by the direct synthesis approach might enable conversational use.…”

Section: Introductionmentioning

confidence: 99%

“…Liu et al compared DNN, RNN and LSTM neural networks for the prediction of the V/UV flag and voicing. They found that the strategy of cascaded prediction, that is, using the predicted spectral features as auxiliary input increases the accuracy of excitation feature prediction [22]. Zhao et al found that the velocity and acceleration of EMA movements are effective in articulatory-to-F0 prediction, and that LSTMs perform better than DNNs in this task.…”

Section: Introductionmentioning

confidence: 99%

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

et al. 2019

View full text Add to dashboard Cite

Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-toacoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.

show abstract

“…Gonzalez and his colleagues compared GMM, DNN and RNN [16] for PMA-based direct synthesis, while we used DNNs to predict the spectral parameters [7] and F0 [8] of a vocoder using UTI as articulatory input. Liu et al compared DNN, RNN and LSTM neural networks for the prediction of the V/U flag and voicing [27], while Zhao et al found that LSTMs perform better than DNNs for articulatory-to-F0 prediction [28].…”

Section: Introductionmentioning

confidence: 99%

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

et al. 2018

View full text Add to dashboard Cite

Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.

show abstract

Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks

Cited by 22 publications

References 24 publications

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

Contact Info

Product

Resources

About