An Instrumental Intelligibility Metric Based on Information Theory

Kuyk, Steven Van; Kleijn, W. Bastiaan; Hendriks, Richard C.

doi:10.1109/lsp.2017.2774250

Cited by 52 publications

(56 citation statements)

References 44 publications

(67 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech intelligibility in bits (SIIB) is an information theoretic intelligibility metric that was recently proposed in [34]. Similar to MIKNN, a non-parametric mutual information estimator [61] is used to estimate the information shared between a clean and distorted speech signal.…”

Section: K Speech Intelligibility In Bitsmentioning

confidence: 99%

“…To investigate the effect of decorrelating input features, SIIB and STOI were modified to produce two intelligibility metrics denoted SIIB noKLT and STOI KLT . To compute SIIB noKLT , the implementation of SIIB described in [34] was used, but the KLT was not applied. To compute STOI KLT three changes are made to the original STOI implementation [22]: 1) Instead of using temporal envelopes to represent speech signals, log-temporal envelopes are used.…”

Section: A Investigating the Effect Of Decorrelating Input Featuresmentioning

confidence: 99%

See 1 more Smart Citation

An Evaluation of Intrusive Instrumental Intelligibility Metrics

Kuyk

Kleijn

Hendriks

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and sEPSM corr . In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of ρ = 0.92 and ρ = 0.89, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called SIIB Gauss , which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.

show abstract

Section: K Speech Intelligibility In Bitsmentioning

confidence: 99%

Section: A Investigating the Effect Of Decorrelating Input Featuresmentioning

confidence: 99%

An Evaluation of Intrusive Instrumental Intelligibility Metrics

Kuyk

Kleijn

Hendriks

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The mapped utterances were initially evaluated using an instrumental intelligibility test called Speech Intelligibility in Bits (SIIB, [31]) using its Gaussian variant (SIIBGauss [31]). Subjective evaluation was then carried out, including an English Intelligibility test and a Finnish Quality test.…”

Section: Discussionmentioning

confidence: 99%

“…Objective intelligibility was measured using SIIB [31,34] that is based on the mutual information between a clean reference and a noisy signal (as used in [9]). The test was conducted on the entire English Lombard grid-speech corpus [25] using two different noise types (unstationary factory noise and stationary Volvo noise [35]) at two signal-to-noise ratio (SNR) levels here referred to as moderate and severe.…”

Section: Discussionmentioning

confidence: 99%

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Seshadri¹,

Juvela²,

Alku³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Lombard speech is a speaking style associated with increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. It is hence desirable to have a system capable of converting speech from normal to Lombard style. Moreover, it would be useful if one could adjust the degree of Lombardness in the converted speech so that the system is more adaptable to different noise environments. In this study, we propose the use of recently developed Augmented cycleconsistent adversarial networks (Augmented CycleGANs) for conversion between normal and Lombard speaking styles. The proposed system gives a smooth control on the degree of Lombardness of the mapped utterances by traversing through different points in the latent space of the trained model. We utilize a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract features from normal speech that are then mapped to Lombard-style features using the Augmented CycleGAN. Finally, the mapped features are converted to Lombard speech with PML. The model is trained on multi-language data recorded in different noise conditions, and we compare its effectiveness to a previously proposed CycleGAN system in experiments for intelligibility and quality of mapped speech.

show abstract

“…To measure the effect of Lombard adaptation on speech intelligibility, a recently developed instrumental intelligibility metric called speech intelligibility in bits (SIIB) [40] was used. SIIB measures the mutual information between a clean reference and a noisy signal.…”

Section: Instrumental Intelligibility Evaluationmentioning

confidence: 99%

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

2019

View full text Add to dashboard Cite

Currently, there is increasing interest to use sequence-tosequence models in text-to-speech (TTS) synthesis with attention like that in Tacotron models. These models are end-to-end, meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech of good quality. However, in challenging speaking styles, such as Lombard speech, it is difficult to record sufficiently large speech corpora. Therefore, we propose a transfer learning method to adapt a TTS system of normal speaking style to Lombard style. We also experiment with a WaveNet vocoder along with a traditional vocoder (WORLD) in the synthesis of Lombard speech. The subjective and objective evaluation results indicated that the proposed adaptation system coupled with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in the synthesis of Lombard speech.

show abstract

An Instrumental Intelligibility Metric Based on Information Theory

Cited by 52 publications

References 44 publications

An Evaluation of Intrusive Instrumental Intelligibility Metrics

An Evaluation of Intrusive Instrumental Intelligibility Metrics

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

Contact Info

Product

Resources

About