Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1101
|View full text |Cite
|
Sign up to set email alerts
|

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings;(2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker si… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
240
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 267 publications
(243 citation statements)
references
References 20 publications
3
240
0
Order By: Relevance
“…Another approach to the permutation indeterminacy problem is "informed extraction", which makes use of additional information to distinguish a target speaker from other participating speakers. Use of visual information [9,10], audio snippets of the speakers [11,12,13], or their locations [14,15,16] were investigated. On top of, or aside from, these algorithmic improvements, researchers also sought more effective input features [17,18] and model architectures [19,20,21].…”
Section: Introductionmentioning
confidence: 99%
“…Another approach to the permutation indeterminacy problem is "informed extraction", which makes use of additional information to distinguish a target speaker from other participating speakers. Use of visual information [9,10], audio snippets of the speakers [11,12,13], or their locations [14,15,16] were investigated. On top of, or aside from, these algorithmic improvements, researchers also sought more effective input features [17,18] and model architectures [19,20,21].…”
Section: Introductionmentioning
confidence: 99%
“…A related research named VoiceFilter [11] (VF) was published recently, targeting speaker-dependent speech separation. VF uses target speaker embedding extracted from a pretrained and fixed speaker recognition network to separate the target voice among multiple speakers.…”
Section: Mse Lossmentioning
confidence: 99%
“…For a reliable comparison between the conventional VF and our proposed method, we use VCTK dataset [16] as indicated in [11], to train and evaluate our network. Moreover, 99 and 10 speakers are randomly selected for each training and validation, followed by the data generation workflow also referred in [11]. Parameter settings are mostly the same as the conventional VF, except for the d-vector dimension being set to 512.…”
Section: Data Generation and Settingsmentioning
confidence: 99%
“…A fundamental problem in machine hearing is deconstructing an acoustic mixture into its constituent sounds. This has been done quite successfully for certain classes of sounds, such as separating speech from nonspeech interference [1,2] or speech from speech [3,4,5,6,7]. However, the more general problem of separating arbitrary classes of sound, known as "universal sound separation", has only recently been addressed [8].…”
Section: Introductionmentioning
confidence: 99%
“…In [16], a conditional variational autoencoder is proposed in order to generate prosodic features for speech synthesis by sampling prosodic embeddings from the bottleneck representation. Specifically for separation tasks, speaker-discriminative embeddings are produced for targeted voice separation in [6] and for diarization in [17] yielding a significant improvement over the unconditional separation framework. Recent works [18,19] have utilized conditional embeddings for each music class in order to boost the performance of a deep attractor-network [20] for music separation.…”
Section: Introductionmentioning
confidence: 99%