Abstract-In the presence of environmental noise, speakers tend to adjust their speech production in an effort to preserve intelligible communication. The noise-induced speech adjustments, called Lombard effect (LE), are known to severely impact the accuracy of automatic speech recognition (ASR) systems. The reduced performance results from the mismatch between the ASR acoustic models trained typically on noise-clean neutral (modal) speech and the actual parameters of noisy LE speech. In this study, novel unsupervised frequency domain and cepstral domain equalizations that increase ASR resistance to LE are proposed and incorporated in a recognition scheme employing a codebook of noisy acoustic models. In the frequency domain, short-time speech spectra are transformed towards neutral ASR acoustic models in a maximum-likelihood fashion. Simultaneously, dynamics of cepstral samples are determined from the quantile estimates and normalized to a constant range. A codebook decoding strategy is applied to determine the noisy models best matching the actual mixture of speech and noisy background. The proposed algorithms are evaluated side by side with conventional compensation schemes on connected Czech digits presented in various levels of background car noise. The resulting system provides an absolute word error rate (WER) reduction on 10-dB signal-to-noise ratio data of 8.7% and 37.7% for female neutral and LE speech, respectively, and of 8.7% and 32.8% for male neutral and LE speech, respectively, when compared to the baseline recognizer employing perceptual linear prediction (PLP) coefficients and cepstral mean and variance normalization.Index Terms-Cepstral compensation, codebook of noisy models, frequency warping, Lombard effect, speech recognition.
Secondary tasks such as cell phone calls or interaction with automated speech dialog systems (SDSs) increase the driver’s cognitive load as well as the probability of driving errors. This study analyzes speech production variations due to cognitive load and emotional state of drivers in real driving conditions. Speech samples were acquired from 24 female and 17 male subjects (approximately 8.5 h of data) while talking to a co-driver and communicating with two automated call centers, with emotional states (neutral, negative) and the number of necessary SDS query repetitions also labeled. A consistent shift in a number of speech production parameters (pitch, first format center frequency, spectral center of gravity, spectral energy spread, and duration of voiced segments) was observed when comparing SDS interaction against co-driver interaction; further increases were observed when considering negative emotion segments and the number of requested SDS query repetitions. A mel frequency cepstral coefficient based Gaussian mixture classifier trained on 10 male and 10 female sessions provided 91% accuracy in the open test set task of distinguishing co-driver interactions from SDS interactions, suggesting—together with the acoustic analysis—that it is possible to monitor the level of driver distraction directly from their speech.
In this paper, we propose initial steps towards multi-modal driver stress (distraction) detection in urban driving scenarios involving multi-tasking, dialog system conversation, and medium-level cognitive tasks. The goal is to obtain a continuous operation-mode detection employing driver's speech and CAN-Bus signals, with a direct application for an intelligent human-vehicle interface which will adapt to the actual state of the driver. First, the impact of various driving scenarios on speech production features is analyzed, followed by a design of a speech-based stress detector. In the driver-/maneuver-independent open test set task, the system reaches 88.2% accuracy in neutral/stress classification. Second, distraction detection exploiting CAN-Bus signals is introduced and evaluated in a driver-/maneuver-dependent closed test set task, reaching 98% and 84% distraction detection accuracy in lane keeping segments and curve negotiation segments, respectively. Performance of the autonomous classifiers suggests that future fusion of speech and CAN-Bus signal domains will yield an overall robust stress assessment framework.
This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7 % absolute WER reduction over the baseline system trained on neutral speech, and a 1.3 % reduction over a baseline system with whisper-adapted acoustic models.
Adverse environments impact the performance of automatic speech recognition systems in two ways -directly by introducing acoustic mismatch between the speech signal and acoustic models, and indirectly by affecting the way speakers communicate to maintain intelligible communication over noise (Lombard effect). Currently, an increasing number of studies have analyzed Lombard effect with respect to speech production and perception, yet limited attention has been paid to its impact on speech systems, especially within a larger vocabulary context. This study presents a large vocabulary speech material captured in the recently acquired portion of UT-Scope database, produced in several types and levels of simulated background noise (highway, crowd, pink). The impact of noisy background variations on speech parameters is studied together with the effects on automatic speech recognition. Front-end cepstral normalization utilizing a modified RASTA filter is proposed and shown to improve recognition performance in a side-byside evaluation with several common and state-of-the-art normalization algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.