On-line speaking rate estimation using Gaussian mixture models

Faltlhauser, Robert; Pfau, Thilo; Ruske, Günther

doi:10.1109/icassp.2000.861830

Cited by 21 publications

(29 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An Euclidean distance is used to estimate this dependency and to discriminate between slow and fast speech. In Falthauser et al (2000), speaking rate dependent GMMs are used to classify speech spurts into slow, medium and fast speech. The output likelihoods of these GMMs are used as input to a neural network whose targets are the actual phonemes.…”

Section: Rate Of Speechmentioning

confidence: 99%

Automatic speech recognition and speech variability: A review

BenZeghiba

Mori

Deroo

et al. 2007

Speech Communication

438

207

View full text Add to dashboard Cite

Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge.Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics.

show abstract

Section: Rate Of Speechmentioning

confidence: 99%

Automatic speech recognition and speech variability: A review

BenZeghiba

Mori

Deroo

et al. 2007

Speech Communication

438

207

View full text Add to dashboard Cite

show abstract

“…However, in [13] we used a combination of multiple acoustic features which was not applicable in real time. In [14], Faltlhauser et al proposed an online speaking rate estimation model based on neural networks. They used GMMs to first separate data into three rate groups (fast, moderate, slow) and built a neural network with the input of the likelihood values generated by GMMs.…”

Section: Introductionmentioning

confidence: 99%

“…The use of RNNs for speaking rate has not been explored in the literature to the best of our knowledge. Although neural networks (NN) have been used to estimate speaking rate in [14], this model does not exploit the longer-term dependencies that RNNs exploit. Moreover, our algorithm requires training a single RNN, whereas the work in [14] uses a sequential procedure that requires training independent models for slow, moderate, and fast speech.…”

Section: Introductionmentioning

confidence: 99%

Online speaking rate estimation using recurrent neural networks

Jiao

Tao

Berisha

et al. 2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

A reliable online speaking rate estimation tool is useful in many domains, including speech recognition, speech therapy intervention, speaker identification, etc. This paper proposes an online speaking rate estimation model based on recurrent neural networks (RNNs). Speaking rate is a long-term feature of speech, which depends on how many syllables were spoken over an extended time window (seconds). We posit that since RNNs can capture long-term dependencies through the memory of previous hidden states, they are a good match for the speaking rate estimation task. Here we train a long shortterm memory (LSTM) RNN on a set of speech features that are known to correlate with speech rhythm. An evaluation on spontaneous speech shows that the method yields a higher correlation between the estimated rate and the ground-truth rate when compared to the state-of-the-art alternatives. The evaluation on longitudinal pathological speech shows that the proposed method can capture long-term and short-term changes in speaking rate.

show abstract

“…In that direction Jiao et al (2015) proposed a convex optimization based speech rate estimation to avoid dependency on heuristic peak detection strategy. Faltlhauser et al (2000) used the Gaussian mixture model (GMM) for classification of speaking rate into three categories -slow, medium and fast. Following this, they used the class probabilities to estimate speaking rate with the help of Neural Networks.…”

Section: Introductionmentioning

confidence: 99%

A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection

Yarra

Deshmukh²,

Ghosh

2016

Speech Communication

View full text Add to dashboard Cite

Acoustic feature based speech (syllable) rate estimation and syllable nuclei detection are important problems in automatic speech recognition (ASR), computer assisted language learning (CALL) and fluency analysis. A typical solution for both the problems consists of two stages. The first stage involves computing a short-time feature contour such that most of the peaks of the contour correspond to the syllabic nuclei. In the second stage, the peaks corresponding to the syllable nuclei are detected. In this work, instead of the peak detection, we perform a mode-shape classification, which is formulated as a supervised binary classification problem -mode-shapes representing the syllabic nuclei as one class and remaining as the other. We use the temporal correlation and selected sub-band correlation (TCSSBC) feature contour and the mode-shapes in the TCSSBC feature contour are converted into a set of feature vectors using an interpolation technique. A support vector machine classifier is used for the classification. Experiments are performed separately using Switchboard, TIMIT and CTIMIT corpora in a five-fold cross validation setup. The average correlation coefficients for the syllable rate estimation turn out to be 0.6761, 0.6928 and 0.3604 for three corpora respectively, which outperform those obtained by the best of the existing peak detection techniques. Similarly, the average F -scores (syllable level) for the syllable nuclei detection are 0.8917, 0.8200 and 0.7637 for three corpora respectively.

show abstract

On-line speaking rate estimation using Gaussian mixture models

Cited by 21 publications

References 5 publications

Automatic speech recognition and speech variability: A review

Automatic speech recognition and speech variability: A review

Online speaking rate estimation using recurrent neural networks

A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection

Contact Info

Product

Resources

About