Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Medennikov, Ivan; Korenevsky, Maxim; Prisyach, Tatiana; Khokhlov, Yuri Y.; Korenevskaya, Mariya; Sorokin, Ivan; Тимофеева, Т.А.; Mitrofanov, Anton; Andrusenko, Andrei; Podluzhny, Ivan; Laptev, Aleksandr; Romanenko, Aleksei

doi:10.21437/interspeech.2020-1602

Cited by 142 publications

(95 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Along the same line of thought, the recent studies on targetspeaker voice activity detection (VAD) show that we are able to obtain the target speaker's boundary in a multi-talker speech, e.g. personal VAD [16], target VAD [17]. In general, the speaker diarization technique is helpful only if the speakers overlap sporadically, while it fails when the speakers are heavily overlapped in time.…”

Section: Introductionmentioning

confidence: 74%

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Rao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single-and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multitask learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equalerror-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

show abstract

Section: Introductionmentioning

confidence: 74%

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Rao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Such clustering based diarization methods are effective only when one speaker is present in each segment, but cannot handle overlapping speech. In recent years, some hybrid methods combining clustering and discriminative methods addressed overlap [6,18] but they do not perform end-to-end diarization directly. TS-VAD [6] which uses speaker embeddings as conditioning inputs to a neural network was inspired from a speaker conditioned VAD approach [19].…”

Section: Related Workmentioning

confidence: 99%

“…Diarization is the task of predicting "who spoke when" given a recording of, e.g. a meeting or conversation [1,2,3], and is an important additional step for many speech applications like automatic speech recognition [4,5,6]. In this work, we focus on diarization for meeting audio, where there may be overlapping speech, from an unknown but bounded number of speakers.…”

Section: Introductionmentioning

confidence: 99%

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Maiti

Erdoğan

Wilson

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant crossentropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multitask transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.

show abstract

“…In the first experiment with simulated two-speaker mixtures, we used the oracle speaker activity with and without noise. In the second experiment with LibriCSS, we used target speaker VAD (TS-VAD) based diarization [16] to obtain the speech activity regions of each speaker.…”

Section: Auxiliary Informationmentioning

confidence: 99%

Speaker Activity Driven Neural Speech Extraction

Delcroix

Žmolíková

Ochiai

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollmentbased approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where the speaker activity is obtained from a diarization system. We show that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25 %.

show abstract

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Cited by 142 publications

References 26 publications

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Speaker Activity Driven Neural Speech Extraction

Contact Info

Product

Resources

About