Voice Activity Detection. Fundamentals and Speech Recognition System Robustness

Ramı́rez, Javier; Górriz, J. M.; Segura, José C.

doi:10.5772/4740

Cited by 186 publications

(109 citation statements)

References 37 publications

Supporting

Mentioning

106

Contrasting

Unclassified

Order By: Relevance

“…Detection performance as a function of the SNR [7] was assessed in terms of the non-speech hit-rate (HR0) and the speech hit-rate (HR1). Most of the VAD algorithms [4] fail when the noise level increases and the noise completely mask the speech signal. A VAD module is used in the speech recognition systems within the feature extraction process.…”

Section: Voice Activity Detector (Vad)mentioning

confidence: 99%

“…If the noise estimate is too high, speech will be distorted resulting possibly in eligibility loss. The simplest approach is to estimate and update the noise spectrum during the silent (pauses) segments of the signal using a voice-activity detection (VAD) [4]. An approach might work satisfactorily in stationary noise, it will not work well in more realistic environments where the spectral characteristics of the noise might be changing constantly.…”

Section: Noise Estimation Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

Shrawankar

Thakare

2010

Intelligent Information Processing V

View full text Add to dashboard Cite

Abstract. Noise is ubiquitous in almost all acoustic environments. The speech signal, that is recorded by a microphone is generally infected by noise originating from various sources. Such contamination can change the characteristics of the speech signals and degrade the speech quality and intelligibility, thereby causing significant harm to human-to-machine communication systems.Noise detection and reduction for speech applications is often formulated as a digital filtering problem, where the clean speech estimation is obtained by passing the noisy speech through a linear filter. With such a formulation, the core issue of noise reduction becomes how to design an optimal filter that can significantly suppress noise without noticeable speech distortion.This paper focuses on voice activity detection, noise estimation, removal techniques and an optimal filter.

show abstract

Section: Voice Activity Detector (Vad)mentioning

confidence: 99%

Section: Noise Estimation Algorithmsmentioning

confidence: 99%

Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

Shrawankar

Thakare

2010

Intelligent Information Processing V

View full text Add to dashboard Cite

show abstract

“…Often, a voice activity detector (VAD) [38,39] is used to detect the speech and non-speech segments in the noisy signal and, then, noise is estimated from the latter segments. Other traditional noise estimation methods are based on tracking spectral minima in each frequency band [29], MMSE-based spectral tracking [21] or comb-filtering [30].…”

Section: Noise Model Estimationmentioning

confidence: 99%

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

González

Gómez²,

Peinado³

et al. 2017

Circuits Syst Signal Process

View full text Add to dashboard Cite

An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good tradeoff between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper we present a detailed overview of this model and its applications to noise-robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: i) mask estimation, i.e. determining the reliability of the noisy features, and ii) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing-data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model (GMM) to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing-data imputation techniques.

show abstract

“…VAD, also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected [18]. The main applications of VAD are in speech coding, speech recognition and speech searching [25].…”

Section: Voice Activity Detectormentioning

confidence: 99%

Monitoring of audio visual quality by key indicators

Fernández¹,

Leszczuk

2017

Multimed Tools Appl

View full text Add to dashboard Cite

Over 10 billion hours of video are watched online every month. Together with high definition television broadcasting and the rise in high quality video on demand, this makes quality assessment a key task in the global multimedia market. Automating quality checking is currently based on finding major audiovisual artefacts. The Monitoring Of Audio Visual quality by key Indicators (MOAVI) subgroup of the Video Quality Experts Group (VQEG) is an open collaborative project for developing No-Reference models for monitoring audiovisual service quality. The purpose of this paper is to report on the development of the audiovisual part of this project, which includes the detection of muting, clipping and lip synchronization (also known as lip sync) artefacts.

show abstract

Voice Activity Detection. Fundamentals and Speech Recognition System Robustness

Cited by 186 publications

References 37 publications

Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Monitoring of audio visual quality by key indicators

Contact Info

Product

Resources

About