Personal VAD: Speaker-Conditioned Voice Activity Detection

Moreno, Ignacio López; Wan, Li; Wang, Quan; Ding, Shaojin; Chang, Shuo-Yiin

doi:10.21437/odyssey.2020-62

Cited by 61 publications

(34 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, some hybrid methods combining clustering and discriminative methods addressed overlap [6,18] but they do not perform end-to-end diarization directly. TS-VAD [6] which uses speaker embeddings as conditioning inputs to a neural network was inspired from a speaker conditioned VAD approach [19].…”

Section: Related Workmentioning

confidence: 99%

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Maiti

Erdoğan

Wilson

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant crossentropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multitask transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.

show abstract

Section: Related Workmentioning

confidence: 99%

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

Maiti

Erdoğan

Wilson

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Lin et al proposed a long short-term memory (LSTM)-based similarity measurement for the clustering-based speaker diarization. Moreover, speech activity estimation based on neural networks has been proposed [28,29] that directly produce the speech activity form the acoustic feature. Kinoshita et al proposed all-neural model that jointly solves speaker diarization, source separation, and source counting and demonstrated the performance on real meeting scenarios.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

Takashima

Fujita

Watanabe

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate.

show abstract

“…For example, target-speaker voice activity detection (TS-VAD) uses i-vectors to output the corresponding speakers' voice activities [21], but the number of speakers is fixed by the model architecture. Personal VAD [22] and VoiceFilter-Lite [23], which are based on d-vectors, have not such a limitation, but they assume that each speaker's d-vector is stored in the database in advance; thus they are not suited for speaker-independent diarization.…”

Section: End-to-end Diarization For Overlapping Speechmentioning

confidence: 99%

End-To-End Speaker Diarization as Post-Processing

Horiguchi

García

Fujita

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.

show abstract

Personal VAD: Speaker-Conditioned Voice Activity Detection

Cited by 61 publications

References 14 publications

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

End-To-End Speaker Diarization as Post-Processing

Contact Info

Product

Resources

About