Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria

Vlaj, Damjan; Kačič, Zdravko; Kos, Marko

doi:10.1016/j.compeleceng.2012.09.003

Cited by 11 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, in [50], a decision-tree algorithm that combines the scores of HMM-based speech/non-speech models and speech pulse information was used for rejecting far-field speech in speech recognition systems. Both [21,52] and [50] use statistical models to characterize speech and non-speech signals, with some decision logics governing the switching between speech and non-speech states. The difference being that in the GMM-VAD of [21], state duration is governed by the number of speech frames (as detected by the GMMs) in a fixed-length buffer, and that in the GMM-VAD of [52] state duration is governed by a hangover and handbefore scheme which detects the consonants occurred at the beginning, middle and the end of words; whereas in the HMM-VAD of [50], the state duration is controlled by the state-transition probabilities of the HMMs and speech pulse information.…”

Section: Introductionmentioning

confidence: 99%

A study of voice activity detection techniques for NIST speaker recognition evaluations

Mak

2014

Computer Speech & Language

109

View full text Add to dashboard Cite

Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a preprocessing step for enhancing the reliability of energy-based and statisticalmodel-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with the ASR transcripts provided by NIST, VAD in the ETSI-AMR Option 2 coder, satistical-model (SM) based VAD, and Gaussian mixture model (GMM) based VAD. Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms these conventional ones whenever interview-style speech is involved. This study also demonstrates that (1) noise reduction is vital for energy-based VAD under low SNR; (2) the ASR transcripts and ETSI-AMR speech coder do not produce accurate speech and non-speech segmentations; and (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD. The segmentation files produced by the proposed VAD can be found in http://bioinfo.eie.polyu.edu.hk/ssvad.

show abstract

Section: Introductionmentioning

confidence: 99%

A study of voice activity detection techniques for NIST speaker recognition evaluations

Mak

2014

Computer Speech & Language

109

View full text Add to dashboard Cite

show abstract

“…The closest one is the VFR VAD [3], which is our previous work that also uses a posteriori SNR weighted energy distance as the feature for VAD decision. The GMM-NLSM [27] VAD provides good performance as well, but still with a 3% (absolute) higher FER as compared with rVAD, and furthermore it should be noted that GMM-NLSM is a supervised VAD where the GMMs are trained using multicondition training data of the Aurora 2 database. The next one in line is the VAD method in the DSR AFE frontend [25], which is an unsupervised VAD and gives a more than 5% (absolute) higher FER than that of rVAD.…”

Section: Comparison With Referenced Methods and Evaluation Of Differementioning

confidence: 95%

“…The comparison in this table is conducted in terms of frame error rate (FER) since results of LTSV and GMM-NSLM are only available in terms of FER. Note that the identical experimental settings and labels are used across [3,17,27] and the present work, so the comparison is valid.…”

Section: Comparison With Referenced Methods and Evaluation Of Differementioning

confidence: 99%

“…Results for the G.729, G.723.1 and MFB VAD methods are cited from [17], results for the LTSV and GMM-NLSM methods are from [27], and results for the DSR-AFE and variable frame rate (VFR) methods are from [3]. The comparison in this table is conducted in terms of frame error rate (FER) since results of LTSV and GMM-NSLM are only available in terms of FER.…”

Section: Comparison With Referenced Methods and Evaluation Of Differe...mentioning

confidence: 99%

“…The MFB output based VAD [17] and the long-term signal variability (LTSV) VAD [26] are highly accurate for clean speech, but their performances in noisy environments are worse than that of the AFE VAD. A supervised VAD method based on a non-linear spectrum modification (NLSM) and GMM [27] is shown to be able to outperform both MFB and LTSV algorithms. In [3] a low-complexity VAD method based on a posteriori SNR weighted energy difference also shows a performance superior to several methods including the MFB VAD and the AFE VAD.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

rVAD: An unsupervised segment-based robust voice activity detection method

Tan

Sarkar

Dehak

2020

Computer Speech & Language

110

View full text Add to dashboard Cite

This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

show abstract

A new efficient backward BSS crosstalk-resistant algorithm for automatic blind speech quality enhancement

Djendi

Zoulikha

2018

Int J Speech Technol

View full text Add to dashboard Cite

Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria

Cited by 11 publications

References 22 publications

A study of voice activity detection techniques for NIST speaker recognition evaluations

A study of voice activity detection techniques for NIST speaker recognition evaluations

rVAD: An unsupervised segment-based robust voice activity detection method

A new efficient backward BSS crosstalk-resistant algorithm for automatic blind speech quality enhancement

Contact Info

Product

Resources

About