Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-715
|View full text |Cite
|
Sign up to set email alerts
|

Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks

Abstract: This paper presents an articulatory-to-acoustic conversion method using electromagnetic midsagittal articulography (EMA) measurements as input features. Neural networks, including feed-forward deep neural networks (DNNs) and recurrent neural networks (RNNs) with long short-term term memory (LSTM) cells, are adopted to map EMA features towards not only spectral features (i.e. mel-cepstra) but also excitation features (i.e. power, U/V flag and F0). Then speech waveforms are reconstructed using the predicted spec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
17
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(17 citation statements)
references
References 24 publications
0
17
0
Order By: Relevance
“…Articulatory-to-acoustic forward mapping: In AAF, forward mapping function is learned from articulatory movements to estimate acoustic features in a subject specific manner. Different methods have been proposed in literature for AAF, such as Gaussian mixture models (GMM) [31], hidden Markov models (HMM) [32], and deep neural networks (DNN) [14] and recurrent neural networks [33]. In [18], a comparison has been made across all the methods and it was shown that BLSTM performs better among the existing statistical methods.…”
Section: Proposed Approachmentioning
confidence: 99%
“…Articulatory-to-acoustic forward mapping: In AAF, forward mapping function is learned from articulatory movements to estimate acoustic features in a subject specific manner. Different methods have been proposed in literature for AAF, such as Gaussian mixture models (GMM) [31], hidden Markov models (HMM) [32], and deep neural networks (DNN) [14] and recurrent neural networks [33]. In [18], a comparison has been made across all the methods and it was shown that BLSTM performs better among the existing statistical methods.…”
Section: Proposed Approachmentioning
confidence: 99%
“…the result of this step is text); this step is then followed by text-to-speech (TTS) synthesis [2,3,7,13,14,18]. In the SSR+TTS approach, any information related to speech prosody is totally lost, while several studies have shown that certain prosodic components may be estimated reasonably well from the articulatory recordings (e.g., pitch [11,16,22,23]). Also, the smaller delay by the direct synthesis approach might enable conversational use.…”
Section: Introductionmentioning
confidence: 99%
“…Liu et al compared DNN, RNN and LSTM neural networks for the prediction of the V/UV flag and voicing. They found that the strategy of cascaded prediction, that is, using the predicted spectral features as auxiliary input increases the accuracy of excitation feature prediction [22]. Zhao et al found that the velocity and acceleration of EMA movements are effective in articulatory-to-F0 prediction, and that LSTMs perform better than DNNs in this task.…”
Section: Introductionmentioning
confidence: 99%
“…Gonzalez and his colleagues compared GMM, DNN and RNN [16] for PMA-based direct synthesis, while we used DNNs to predict the spectral parameters [7] and F0 [8] of a vocoder using UTI as articulatory input. Liu et al compared DNN, RNN and LSTM neural networks for the prediction of the V/U flag and voicing [27], while Zhao et al found that LSTMs perform better than DNNs for articulatory-to-F0 prediction [28].…”
Section: Introductionmentioning
confidence: 99%