Integrating articulatory data in deep neural network-based acoustic modeling

Badino, Leonardo; Canevari, Claudia; Fadiga, Luciano; Metta, Giorgio

doi:10.1016/j.csl.2015.05.005

Cited by 39 publications

(31 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Again, we check whether this improvement could be due solely to the extra acoustic data, by training a similar model on only the acoustic input; the result (row 9) is worse, indicating that our improvements are not due to the extra acoustics alone. Row 10 corresponds to a single recognizer trained on the merged acoustic data of XRMB and TIMIT; this model does surprisingly well, but still 3 In this case we use VCCAP with a 71-frame window acoustic input. Table 2: PER (%) for XRMB→TIMIT.…”

Section: Xrmb → Timitmentioning

confidence: 99%

Acoustic Feature Learning Using Cross-Domain Articulatory Measurements

Tang

Wang

Livescu

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.

show abstract

Section: Xrmb → Timitmentioning

confidence: 99%

Acoustic Feature Learning Using Cross-Domain Articulatory Measurements

Tang

Wang

Livescu

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The input to the DNNs consisted of 5 concatenated MFCC vectors (a context size used in all our previous work [3]) used to estimate a vector of 16 AFs. The 39 MFCCs were previously normalized to have 0 mean and 1 standard deviation.…”

Section: Dnns Stl-and Mtl-based Trainingmentioning

confidence: 99%

“…Measured vocal tract movements, i.e., articulatory features (AFs), can be beneficial for several speech technology applications, including speech synthesis [1], automatic speech recognition (ASR) [2,3], pronunciation training [4] and speech-driven computer animation [5]. Techniques for measuring AFs range from electromagnetic articulography (EMA) to ultrasound and functional magnetic resonance imaging (fMRI).…”

Section: Introductionmentioning

confidence: 99%

“…Typically, AFs are much more difficult to collect than audio, require extensive preprocessing steps to reduce noise and interpolate missing data, and, in real-usage scenarios, are often only available at training time. For most of the aforementioned applications (e.g., articulatory ASR [3]), an acoustic inversion (AI) mapping is necessary to recover AFs from the acoustic signal.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Speaker Adaptive DNN Training Approach for Speaker-Independent Acoustic Inversion

Badino¹,

Franceschi²,

Arora³

et al. 2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

We address the speaker-independent acoustic inversion (AI) problem, also referred to as acoustic-to-articulatory mapping. The scarce availability of multi-speaker articulatory data makes it difficult to learn a mapping which generalizes from a limited number of training speakers and reliably reconstructs the articulatory movements of unseen speakers. In this paper, we propose a Multi-task Learning (MTL)-based approach that explicitly separates the modeling of each training speaker AI peculiarities from the modeling of AI characteristics that are shared by all speakers. Our approach stems from the well known Regularized MTL approach and extends it to feed-forward deep neural networks (DNNs). Given multiple training speakers, we learn for each an acoustic-to-articulatory mapping represented by a DNN. Then, through an iterative procedure, we search for a canonical speaker-independent DNN that is "similar" to all speaker-dependent DNNs. The degree of similarity is controlled by a regularization parameter. We report experiments on the University of Wisconsin X-ray Microbeam Database under different training/testing experimental settings. The results obtained indicate that our MTL-trained canonical DNN largely outperforms a standardly trained (i.e., single task learning-based) speaker independent DNN.

show abstract

“…In AAI, the objective is to estimate the vocal tract shape, which is estimated by the articulator positions based on the uttered speech. AAI can be useful in many speech-based applications, in particular, speech synthesis [1], automatic speech recognition (ASR) [2,3,4] and second language learning [5,6]. Over the years, researchers have addressed this problem employing various machine learning techniques including codebooks [7], Gaussian mixture models (GMM) [8], hidden Markov models (HMM) [9], mixture density networks [10], deep neural networks (DNNs) [11,12,13], and deep recurrent neural networks (RNNs) [14,15,16].…”

Section: Introductionmentioning

confidence: 99%

A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion

et al. 2019

View full text Add to dashboard Cite

The challenge of articulatory inversion is to determine the temporal movement of the articulators from the speech waveform, or from acoustic-phonetic knowledge, e.g. derived from information about the linguistic content of the utterance. The actual position of the articulators is typically obtained from measured data, in our case position measurements obtained using EMA (Electromagnetic articulography). In this paper, we investigate the impact on articulatory inversion problem by using features derived from the acoustic waveform relative to using linguistic features related to the time aligned phone sequence of the utterance. Filterbank energies (FBE) are used as acoustic features, while phoneme identities and (binary) phonetic attributes are used as linguistic features. Experiments are performed on a speech corpus with synchronously recorded EMA measurements and employing a bidirectional long short-term memory (BLSTM) that estimates the articulators' position. Acoustic FBE features performed better for vowel sounds. Phonetic features attained better results for nasal and fricative sounds except for /h/. Further improvements were obtained by combining FBE and linguistic features, which led to an average relative RMSE reduction of 9.8%, and a 3% relative improvement of the Pearson correlation coefficient.

show abstract

Integrating articulatory data in deep neural network-based acoustic modeling

Cited by 39 publications

References 28 publications

Acoustic Feature Learning Using Cross-Domain Articulatory Measurements

Acoustic Feature Learning Using Cross-Domain Articulatory Measurements

A Speaker Adaptive DNN Training Approach for Speaker-Independent Acoustic Inversion

A Phonetic-Level Analysis of Different Input Features for Articulatory Inversion

Contact Info

Product

Resources

About