A Shifted Delta Coefficient Objective for Monaural Speech Separation Using Multi-task Learning

Xu, Chenglin; Rao, Wei; Chng, Eng Siong; Li, Haizhou

doi:10.21437/interspeech.2018-1150

Cited by 17 publications

(16 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to better compare the performance of our proposed method (uPIT+DEF+DL) and other separation methods, Table 2 presents the results of SDR (dB) in the other competitive approaches on the same WSJ0-2mix dataset. Note that, for [9,12,25,15,26,13] methods are use SDR improvements results. Therefore, we manually add 0.2 dB to their final results although the SDR result of the mixture is only about 0.15 dB.…”

Section: Comparisons With Other Separation Methodsmentioning

confidence: 99%

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Fan¹,

Liu²,

Tao³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speakerindependent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which results in complex separation pipelines and a huge obstacle in directly optimizing the actual separation objectives. As for uPIT, it only minimizes the chosen permutation with the lowest mean square error, doesn't discriminate it with other permutations. In this paper, we propose a discriminative learning method for speaker-independent speech separation using deep embedding features. Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers. Then these features are used as the input for uPIT to directly separate the different sources. Finally, uPIT and DC are jointly trained, which directly optimizes the actual separation objectives. Moreover, in order to maximize the distance of each permutation, the discriminative learning is applied to fine tuning the whole model. Our experiments are conducted on WSJ0-2mix dataset. Experimental results show that the proposed models achieve better performances than DC and uPIT for speaker-independent speech separation.

show abstract

Section: Comparisons With Other Separation Methodsmentioning

confidence: 99%

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Fan¹,

Liu²,

Tao³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…To recover a single-talker speech sample, the monaural speech separation techniques could come in handy. Successful implementations include deep clustering [18], deep attractor network [19], permutation invariant training [20]- [22], Conv-TasNet [23], DPRNN [24]. However, speech separation technique seeks to recover the single-talker speech for each individual, that is not only an overkill for speaker verification, but also difficult particularly when we don't know the number of speakers in the multi-talker speech.…”

Section: Introductionmentioning

confidence: 99%

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Rao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single-and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multitask learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equalerror-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

show abstract

“…Recent deep learning based methods, such as Deep Clustering (DC) [3][4][5], Deep Attractor Network (DANet) [6], Permutation Invariant Training (PIT) methods [7][8][9][10], have significantly advanced the performance of multi-taker speech separation. However, the number of speaker has to be known Wei Rao contributed to this work before joining National University of Singapore.…”

Section: Introductionmentioning

confidence: 99%

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

Rao

Chng

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spectrum approximation loss to estimate a phase sensitive mask for the target speaker with the speaker characteristics. Moreover, this paper explores a concatenation framework instead of the context adaptive deep neural network in the SBF method to encode a speaker embedding into the mask estimation network. Experimental results under open evaluation condition show that the proposed method achieves 70.4% and 17.7% relative improvement over the SBF baseline on signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ), respectively. A further analysis demonstrates 69.1% and 72.3% relative SDR improvements obtained by the proposed method for different and same gender mixtures.

show abstract

A Shifted Delta Coefficient Objective for Monaural Speech Separation Using Multi-task Learning

Cited by 17 publications

References 22 publications

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

Contact Info

Product

Resources

About