ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053327
|View full text |Cite
|
Sign up to set email alerts
|

WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

Abstract: While significant advances have been made in recent years in the separation of overlapping speech signals, studies have been largely constrained to mixtures of clean, near-field speech, not representative of many real-world scenarios. Although the WHAM! dataset introduced noise to the ubiquitous wsj0-2mix dataset, it did not include the addition of reverberation, generally present in indoor recordings outside of recording studios. The spectral smearing caused by reverberation can result in significant performa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
93
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 123 publications
(93 citation statements)
references
References 24 publications
0
93
0
Order By: Relevance
“…The evaluation is performed on the WHAMR! dataset [21], which consists of simulated noisy and reverberant 2-speaker mixtures. WHAMR!…”
Section: Data Simulationmentioning
confidence: 99%
“…The evaluation is performed on the WHAMR! dataset [21], which consists of simulated noisy and reverberant 2-speaker mixtures. WHAMR!…”
Section: Data Simulationmentioning
confidence: 99%
“…pMnet takes the log magnitude spectrum of the beamformer output signals and generates real-valued time-frequency masks corresponding to each source. Motivated by the results in [1,4] where a masking-based network with recurrent layers was used for speech separation, we consider a recurrent pMnet consisting of four BLSTM layers followed by one fully connected layer to estimate the stacked masks corresponding to the sources. As an alternative to the recurrent pMnet, we also consider a convolutional-recurrent pMnet with an encoder-decoder network architecture (see Fig.…”
Section: Dbnet Extensions With Post Maskingmentioning
confidence: 99%
“…Each sequence was a random 10 s sub-portion from the 30 s signals. To improve the network generalization, we use gradient clipping technique with a maximum L2 norm of 5, similarly as used in [4].…”
Section: Network Training and Stft Setupmentioning
confidence: 99%
“…Without access to the clean ground truth s clean k (t), the network is instead trained with the noisy speech signals s noisy k (t) as target. Though arguably more desirable for the separation network to produce s clean k (t) than s noisy k (t), it is not an inherently incorrect separation solution as long as the speech signals themselves have been separated, particularly when taking into account the possibility to then feed the noisy separated output into a de-noising speech enhancement network, something shown to be successful with synthetic noisy mixtures [8]. However, this paradigm nevertheless likely has issues stemming from inseparability of the noise mixtures.…”
Section: Noisy Separation Formulationsmentioning
confidence: 99%