In this paper, we investigate the effect of temporal correlation on the dependence between the speech narrow and high frequency bands covering the 0.3-3.4 kHz and 3.7-8 kHz ranges, respectively. We follow the technique of using Gaussian mixture modelling of spectral envelopes represented by Mel-frequency cepstral coefficients. The correlation between the disjoint speech frequency bands is quantified through mutual information (MI) and its ratio to highband entropy. Speech exhibits considerable temporal correlation that is not explicitly accounted for by static parametrization of spectral envelopes. Including memory in speech parametrization (through delta features) incorporates such temporal information of speech in its modelling, and hence, MI gains are to be expected resulting in bandwidth extension with better performance. Results show that exploiting delta features can increase certainty about the highband (ratio of MI to highband entropy) by as much as 216% relatively, corresponding to an absolute increase of 12%.
In this paper, we continue our previous work on improving Bandwidth Extension (BWE) of narrowband speech. We have shown that including memory into the parametrization frontend (through delta features) results in higher highband certainty irrespective of feature type, with MFCCs exhibiting higher correlation, in general, between both bands, reaching twice that using LSFs. By incorporating memory into the frontend of a conventional LP-based BWE system, we were able to translate the higher correlation due to memory into BWE performance improvement. Using high-resolution inverse DCT, we also achieved high quality speech reconstruction from MFCCs, thus enabling MFCC-based BWE with improved performance compared to conventional static LP-based BWE. We continue this work by incorporating the superior correlation properties of frontend memory into our MFCC-based BWE system. Log-Spectral Distortion as well as the more perceptually-correlated Itakura-based measures show that incorporating memory into our MFCC-based BWE system results in BWE performance superior to that of our dynamic LP-based BWE system. Index Terms-Bandwidth extension, memory inclusion, highresolution IDCT, highband certainty, mutual information BACKGROUNDIn traditional telephone networks, speech bandwidth is limited to the 0.3-3.4 kHz range. As a result, narrowband speech has sound quality inferior to its wideband counterpart and has reduced intelligibility especially for consonant sounds. Wideband speech reconstruction through Bandwidth Extension (BWE) attempts to regenerate the highband (3.4-7 kHz) signal lost during the filtering processes employed in traditional networks, thereby providing backward compatibility with existing networks. BWE is based on the assumption that narrowband speech correlates with the highband signal, and thus, given some a priori information about the nature of this correlation, the higher frequency speech content can be estimated given only the available narrow band. Most BWE schemes use either codebook mapping or statistical modelling to perform this estimation.Since BWE performance closely follows the correlation available between representations of the narrow and high frequency bands, the premise of our work has been to quantify this correlation for different speech representations in order to adopt those representations with the greatest potential for BWE performance improvement. In our previous work; first introduced in [1] and later extended in [2], we made use of the concept of highband certainty (certainty about the high band given the narrow band); defined in [3] as the ratio of Mutual Information (MI) between the two bands to the discrete entropy of the high band, in order to quantify the correlation between speech frequency bands. Through highband certainty, we investigated the effect of including memory into the frontend on the resulting correlation (by using delta features in addition to the conventional static features which make no use of the considerable temporal correlation properties of speech), as ...
Estimating the quality of speech without the use of a clean reference signal is a challenging problem, in part due to the time and expense required to collect sufficient training data for modern machine learning algorithms. We present a novel, non-intrusive estimator that exploits recurrent neural network architectures to predict the intrusive POLQA score of a speech signal in a short time context. The predictor is based on a novel compressed representation of modulation domain features, used in conjunction with static MFCC features. We show that the proposed method can reliably predict POLQA with a 300 ms context, achieving a mean absolute error of 0.21 on unseen data. The proposed method is trained using English speech and is shown to generalize well across unseen languages. The neural network also jointly estimates the mean voice activity detection (VAD) with an F1 accuracy score of 0.9, removing the need for an external VAD.
We present a novel MFCC-based scheme for the Bandwidth Extension (BWE) of narrowband speech. BWE is based on the assumption that narrowband speech (0.3-3.4 kHz) correlates closely with the highband signal (3.4-7 kHz), enabling estimation of the highband frequency content given the narrow band. While BWE schemes have traditionally used LP-based parametrizations, our recent work has shown that MFCC parametrization results in higher correlation between both bands reaching twice that using LSFs. By employing high-resolution IDCT of highband MFCCs obtained from narrowband MFCCs by statistical estimation, we achieve highquality highband power spectra from which the time-domain speech signal can be reconstructed. Implementing this scheme for BWE translates the higher correlation advantage of MFCCs into BWE performance superior to that obtained using LSFs, as shown by improvements in log-spectral distortion as well as Itakura-based measures (the latter improving by up to 13%).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.