Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2526
|View full text |Cite
|
Sign up to set email alerts
|

A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion

Abstract: The challenge of articulatory inversion is to determine the temporal movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from information about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…Different acoustic representations, such as Line Spectral Frequencies (LSFs) [19], Perceptual Linear Predictive coding (PLP) [20] and Mel-Frequency Cepstral Coefficients (MFCCs) [21] have been widely used for the AAI task. Filter-Bank Energies (FBEs) from STRAIGHT spectra [22] have also been employed as the input of the AAI system [23]. Among these features, MFCCs are reported to perform better compared to other features for SI-AAI [24], [25].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Different acoustic representations, such as Line Spectral Frequencies (LSFs) [19], Perceptual Linear Predictive coding (PLP) [20] and Mel-Frequency Cepstral Coefficients (MFCCs) [21] have been widely used for the AAI task. Filter-Bank Energies (FBEs) from STRAIGHT spectra [22] have also been employed as the input of the AAI system [23]. Among these features, MFCCs are reported to perform better compared to other features for SI-AAI [24], [25].…”
Section: Introductionmentioning
confidence: 99%
“…In the literature, various techniques are applied to the AAI problem, e.g. search-based algorithms in the joint codebook of the acoustic-articulatory space [26], [27], non-parametric and parametric statistical methods, such as support vector regression (SVR) [28], local regression approach based on K-nearest neighbour [29], joint acoustic-articulatory distribution by utilizing Gaussian mixture models (GMMs) [30], hidden Markov models (HMMs) [7], mixture density networks (MDNs) [31], deep neural networks (DNNs) [4], [32], and recurrent neural networks (RNNs) [23], [33]- [39]. Among those methods, the neural network based models outperform the rest by having the ability of dealing well with large context size and better modelling of acoustic and articulatory spaces.…”
Section: Introductionmentioning
confidence: 99%
“…There exist several techniques to address the AAI problem, for example, search-based algorithms in the joint codebook of the acoustic-articulatory space [13], [14], non-parametric and parametric statistical methods such as support vector regression (SVR) [15], joint acoustic-articulatory distribution by utilizing Gaussian mixture models (GMMs) [16], hidden Markov models (HMMs) [17], mixture density networks (MDNs) [18], deep neural networks (DNNs) [19], and recurrent neural networks (RNNs) [20], [21]. However, the great majority of those works deals with clean conditions only.…”
Section: Introductionmentioning
confidence: 99%
“…The first problem is the one-to-many mapping problem because several articulator gestures may produce the same acoustic speech signal. A common approach to address this problem is to employ trajectory based deep neural networks [12,13,14,15]. The next problem is insufficient amounts of data for adequate modeling of the acoustic space, leading to inferior performance for speaker independent (SI) scenarios compared to the speaker dependent (SD) scenarios, or matched speakers compared to mismatched speakers in SI scenarios.…”
Section: Introductionmentioning
confidence: 99%
“…the predicted phone sequence for that utterance, can be used. Indeed, to cope with scarcity of input data for modeling the acoustic space in the AAI task, augmenting the acoustic features with linguistic information has been shown to improve the performance [16,13,15] for SD scenarios. Systems utilizing the linguistic information alone have also been reported to work quite well [17,15] even when using binary features, e.g.…”
Section: Introductionmentioning
confidence: 99%