ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414445
|View full text |Cite
|
Sign up to set email alerts
|

Speech Enhancement Aided End-To-End Multi-Task Learning for Voice Activity Detection

Abstract: Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we propose a speech enhancement aided end-to-end multi-task model for VAD. The model has two decoders, one for speech enhancement and the other for VAD. The two decoders share the same encoder and speech separation network. Unlike the direct thought that takes two separated ob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 35 publications
(10 citation statements)
references
References 24 publications
0
10
0
Order By: Relevance
“…A diffuse noise generator 2 is used to simulated uncorrelated diffuse noise. The noise source for training and development is a large-scale noise library containing over 20000 noise segments [28], and the noise source for test is the noise segments from CHiME-3 dataset [29] and NOISEX-92 corpus [30]. We randomly generate 20 channels for training and development, and 20, 30 and 40 channels respectively for test.…”
Section: Experiments a Datasetmentioning
confidence: 99%
“…A diffuse noise generator 2 is used to simulated uncorrelated diffuse noise. The noise source for training and development is a large-scale noise library containing over 20000 noise segments [28], and the noise source for test is the noise segments from CHiME-3 dataset [29] and NOISEX-92 corpus [30]. We randomly generate 20 channels for training and development, and 20, 30 and 40 channels respectively for test.…”
Section: Experiments a Datasetmentioning
confidence: 99%
“…This method can be used, for example, to identify an object within an image, recognize the overall scene and generate a verbal caption for it (e.g., [17], [18]). Also, for speech processing, MTL can be used to improve speech activity detection (e.g., [19], [20]).…”
Section: Introductionmentioning
confidence: 99%
“…Some work uses advanced speech enhancement models, like denoising variational autoencoders [25], convolutionalrecurrent-network-based speech enhancement, residual-convolutional neural-network [26] and U-Net [27], to extract denoised features for VAD. In [28,29], the works optimizes VAD and speech enhancement jointly in the framework of multitask learning. In [30], information about the speaker was exploited for the VAD, which makes VAD able to extract speaker-dependent speech segments.…”
Section: Introductionmentioning
confidence: 99%