Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1602
|View full text |Cite
|
Sign up to set email alerts
|

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
95
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 142 publications
(95 citation statements)
references
References 26 publications
0
95
0
Order By: Relevance
“…Along the same line of thought, the recent studies on targetspeaker voice activity detection (VAD) show that we are able to obtain the target speaker's boundary in a multi-talker speech, e.g. personal VAD [16], target VAD [17]. In general, the speaker diarization technique is helpful only if the speakers overlap sporadically, while it fails when the speakers are heavily overlapped in time.…”
Section: Introductionmentioning
confidence: 74%
“…Along the same line of thought, the recent studies on targetspeaker voice activity detection (VAD) show that we are able to obtain the target speaker's boundary in a multi-talker speech, e.g. personal VAD [16], target VAD [17]. In general, the speaker diarization technique is helpful only if the speakers overlap sporadically, while it fails when the speakers are heavily overlapped in time.…”
Section: Introductionmentioning
confidence: 74%
“…Such clustering based diarization methods are effective only when one speaker is present in each segment, but cannot handle overlapping speech. In recent years, some hybrid methods combining clustering and discriminative methods addressed overlap [6,18] but they do not perform end-to-end diarization directly. TS-VAD [6] which uses speaker embeddings as conditioning inputs to a neural network was inspired from a speaker conditioned VAD approach [19].…”
Section: Related Workmentioning
confidence: 99%
“…Diarization is the task of predicting "who spoke when" given a recording of, e.g. a meeting or conversation [1,2,3], and is an important additional step for many speech applications like automatic speech recognition [4,5,6]. In this work, we focus on diarization for meeting audio, where there may be overlapping speech, from an unknown but bounded number of speakers.…”
Section: Introductionmentioning
confidence: 99%
“…In the first experiment with simulated two-speaker mixtures, we used the oracle speaker activity with and without noise. In the second experiment with LibriCSS, we used target speaker VAD (TS-VAD) based diarization [16] to obtain the speech activity regions of each speaker.…”
Section: Auxiliary Informationmentioning
confidence: 99%