Normalized amplitude modulation features for large vocabulary noise-robust speech recognition

Mitra, Vikramjit; Franco, Horacio; Graciarena, Martin; Mandal, Arindam

doi:10.1109/icassp.2012.6288824

Cited by 77 publications

(37 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For improving robustness, the normalized modulation spectra have been proposed in [23]. Similar work in the context of large vocabulary speech recognition such as noisy Wall Street Journal (New York, NY, USA) and GALE task as reported in [24,25].…”

Section: Related Workmentioning

confidence: 88%

Auditory processing-based features for improving speech recognition in adverse acoustic conditions

Maganti

Matassoni

2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

The paper describes an auditory processing-based feature extraction strategy for robust speech recognition in environments, where conventional automatic speech recognition (ASR) approaches are not successful. It incorporates a combination of gammatone filtering, modulation spectrum and non-linearity for feature extraction in the recognition chain to improve robustness, more specifically the ASR in adverse acoustic conditions. The experimental results with standard Aurora-4 large vocabulary evaluation task revealed that the proposed features provide reliable and considerable improvement in terms of robustness in different noise conditions and are comparable to those of standard feature extraction techniques.

show abstract

Section: Related Workmentioning

confidence: 88%

Auditory processing-based features for improving speech recognition in adverse acoustic conditions

Maganti

Matassoni

2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…The MFCCs were also augmented with a 10-dimensional voicing feature vector [12]. The three novel features explored were: (1) The Normalized Modulation Cepstral Coefficient (NMCC) [13], obtained from tracking the amplitude modulations of the sub-band speech signals in time domain. The produced 52-dimensional vector was reduced to 20 with principal component analysis (PCA) (NMCC20).…”

Section: Sri Asr Systemsmentioning

confidence: 99%

Calibration and multiple system fusion for spoken term detection using linear logistic regression

Hout

Ferrer

Vergyri

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State-of-the-art calibration and fusion approaches for spoken term detection (STD) systems currently rely on a multi-pass approach where the scores are calibrated, then fused, and finally re-calibrated to obtain a single decision threshold across keywords. While the above techniques are theoretically correct, they rely on metaparameter tuning and are prone to over-fitting. This study presents an efficient and effective score calibration technique for keyword detection that is based on the logistic regression calibration approach commonly used in forensic speaker identification. The technique applies seamlessly to both single systems and to system fusion, and enables optimization for specific keyword detection evaluation functions. We run experiments on a Vietnamese STD task, comparing the technique with more empirical calibration and fusion schemes and demonstrate that we can achieve comparable or better performance in terms of the NIST ATWV metric with a more elegant solution.

show abstract

“…followed by cepstral feature extraction; or (2) by using noise robust speech-processing approaches, where noiserobust transforms and/or human perception based speech analysis methodologies are deployed for acoustic-feature generation (e.g., ETSI [European Telecomm. Standards Institute] advanced frontend [4], power normalized cepstral coefficients [PNCC] [5], modulation based features [6,7], and several others).…”

Section: Introductionmentioning

confidence: 99%

Medium-duration modulation cepstral feature for robust speech recognition

Mitra

Franco

Graciarena

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Studies have shown that the performance of state-of-the-art automatic speech recognition (ASR) systems significantly deteriorate with increased noise levels and channel degradations, when compared to human speech recognition capability. Traditionally, noise-robust acoustic features are deployed to improve speech recognition performance under varying background conditions to compensate for the performance degradations. In this paper, we present the Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature, which is a composite feature capturing subband speech modulations and a summary modulation. We analyze MMeDuSA's speech recognition performance using SRI International's DECIPHER ® large vocabulary continuous speech recognition (LVCSR) system, on noise and channel degraded Levantine Arabic speech distributed through the Defense Advance Research Projects Agency (DARPA) Robust Automatic Speech Transcription (RATS) program. We also analyzed MMeDuSA's performance against the Aurora-4 noise-and-channel degraded English corpus. Our results from all these experiments suggest that the proposed MMeDuSA feature improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.

show abstract

Normalized amplitude modulation features for large vocabulary noise-robust speech recognition

Cited by 77 publications

References 14 publications

Auditory processing-based features for improving speech recognition in adverse acoustic conditions

Auditory processing-based features for improving speech recognition in adverse acoustic conditions

Calibration and multiple system fusion for spoken term detection using linear logistic regression

Medium-duration modulation cepstral feature for robust speech recognition

Contact Info

Product

Resources

About