Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

Madhavi, Maulik C.; Patil, Hemant A.

doi:10.1016/j.csl.2019.03.005

Cited by 7 publications

(1 citation statement)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A classic approach to tackle the problem of speaker variablity in Automatic Speech Recognition is Vocal Tract Length Normalization (VTLN) [9][10][11], which has also been applied in lowresource settings, e.g., for spoken term detection [12]. Furthermore, Feature Space Maximum Likelihood Linear Regression (fMLLR) is often used, even in low-resource setups [13].…”

Section: Related Workmentioning

confidence: 99%

Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

Glarner¹,

Ebbers²,

Häb‐Umbach³

2021

Preprint

View full text Add to dashboard Cite

Discovering speaker independent acoustic units purely from spoken input is known to be a hard problem. In this work we propose an unsupervised speaker normalization technique prior to unit discovery. It is based on separating speaker related from content induced variations in a speech signal with an adversarial contrastive predictive coding approach. This technique does neither require transcribed speech nor speaker labels, and, furthermore, can be trained in a multilingual fashion, thus achieving speaker normalization even if only few unlabeled data is available from the target language. The speaker normalization is done by mapping all utterances to a medoid style which is representative for the whole database. We demonstrate the effectiveness of the approach by conducting acoustic unit discovery with a hidden Markov model variational autoencoder noting, however, that the proposed speaker normalization can serve as a front end to any unit discovery system. Experiments on English, Yoruba and Mboshi show improvements compared to using non-normalized input.

show abstract