Impact of vocal effort variability on automatic speech recognition

Zelinka, Petr; Sigmund, Milan; Schimmel, Jiří

doi:10.1016/j.specom.2012.01.002

Cited by 49 publications

(18 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Typically, two main strategies are used to handle the mismatch problem, namely, (1) multiple model recognizer, where dedicated speaker models are obtained for different vocal efforts (e.g., [14]) and (2) multi-style models, where each model is obtained from a combination of normal speech and small amounts of speech of varying vocal efforts [15,14]. Notwithstanding, the two different methods were shown to have their advantages and disadvantages.…”

Section: Introductionmentioning

confidence: 99%

“…Notwithstanding, the two different methods were shown to have their advantages and disadvantages. For example, while both improve the performance of whispered speech [8,14], multiple model training requires significant amounts of whispered speech data to obtain the speaker models, which can be hard to obtain in practice. Multi-style based systems, in turn, despite requiring lower amounts of whispered speech to train the models, trade gains in whispered speech to losses in normal speech accuracy, often by the same amount [14].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Feature mapping, score-, and feature-level fusion for improved normal and whispered speech speaker verification

Sarria-Paja

Senoussaoui

O’Shaughnessy

et al. 2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, automatic speaker verification using normal and whispered speech is explored. Typically, for speaker verification systems with varying vocal effort inputs, standard solutions such as feature mapping or addition of data during parameter estimation (training) and enrollment stages result in a trade-off between accuracy gains with whispered test data and accuracy losses (up to 70% in equal error rate, EER) with normal test data. To overcome this shortcoming, this paper proposes two innovations. First, we show the complementarity of features derived from AM-FM models over conventional mel-frequency cepstral coefficients, thus signalling the importance of instantaneous phase information for whispered speech speaker verification. Next, two fusion schemes are explored: score-and feature-level fusion. Overall, we show that gains as high as 30% and 84% in EER can be achieved for normal and whispered speech, respectively, using featurelevel fusion.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Feature mapping, score-, and feature-level fusion for improved normal and whispered speech speaker verification

Sarria-Paja

Senoussaoui

O’Shaughnessy

et al. 2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…3,4 Also, detection of high vocal effort can be applied in speech and speaker recognition in order to tackle a possible mismatch between training and testing conditions. 5,6 For all these technological applications, the performance of human listeners in shout detection serves as a natural point of comparison.…”

Section: Introductionmentioning

confidence: 99%

Detection of shouted speech in noise: Human and machine

Pohjalainen

Raitio

Yrttiaho

et al. 2013

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

High vocal effort has characteristic acoustic effects on speech. This study focuses on the utilization of this information by human listeners and a machine-based detection system in the task of detecting shouted speech in the presence of noise. Both female and male speakers read Finnish sentences using normal and shouted voice in controlled conditions, with the sound pressure level recorded. The speech material was artificially corrupted by noise and supplemented with pure noise. The human performance level was statistically evaluated by a listening test, where the subjects labeled noisy samples according to whether shouting was heard or not. A Bayesian detection system was constructed and statistically evaluated. Its performance was compared against that of human listeners, substituting different spectrum analysis methods in the feature extraction stage. Using features capable of taking into account the spectral fine structure (i.e., the fundamental frequency and its harmonics), the machine reached the detection level of humans even in the noisiest conditions. In the listening test, male listeners detected shouted speech significantly better than female listeners, especially with speakers making a smaller vocal effort increase for shouting.

show abstract

“…However there are still different research problems that have received little attention, and require more effort to make advances towards the understanding of speech communication. That is the case when there are changes in the vocal effort, which have proven to affect significantly the performance of automatic speech recognition and speaker recognition systems [1,2,3]. Particularly, whispered speech exhibits significant differences with normal phonated speech, being the main physical difference the complete lack of vocal folds vibration.…”

Section: Introductionmentioning

confidence: 99%

“…This strategy can improve significantly the performance of recognition systems, thus allowing for normal and whispered speech to be handled. Nevertheless, different authors suggest that for optimal applications, it is better to have dedicated models for each vocal effort and select the most likely model according to the detected vocal effort [5,3].…”

Section: Introductionmentioning

confidence: 99%

Whispered speaker verification and gender detection using weighted instantaneous frequencies

Sarria-Paja

Falk

O’Shaughnessy

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

In this paper, automatic speaker verification and gender detection using whispered speech is explored. Whispered speech, despite its reduced perceptibility, has been shown to convey relevant speaker identity and gender information. This study compares the performance of a GMM-UBM speaker verification system trained with normal and whispered speech under different matched and mismatched conditions, and describes the benefits of adaptation in a speaking-style independent model to handle both vocal efforts. It is shown that performance improvements can be achieved by using speaking-style and gender dependent models, as well as by adding features based on the AM-FM signal representation. Moreover, the AM-FM based features showed to be more discriminative than classical MFCCs for whispered speech gender detection. Experimental results suggest that whispered speech carries sufficient information for reliable automatic speaker identification.

show abstract

Impact of vocal effort variability on automatic speech recognition

Cited by 49 publications

References 19 publications

Feature mapping, score-, and feature-level fusion for improved normal and whispered speech speaker verification

Feature mapping, score-, and feature-level fusion for improved normal and whispered speech speaker verification

Detection of shouted speech in noise: Human and machine

Whispered speaker verification and gender detection using weighted instantaneous frequencies

Contact Info

Product

Resources

About