The Speaker and Language Recognition Workshop (Odyssey 2020) 2020
DOI: 10.21437/odyssey.2020-62
|View full text |Cite
|
Sign up to set email alerts
|

Personal VAD: Speaker-Conditioned Voice Activity Detection

Abstract: In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For every frame, personal VAD outputs the scores for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 61 publications
(34 citation statements)
references
References 14 publications
1
33
0
Order By: Relevance
“…In recent years, some hybrid methods combining clustering and discriminative methods addressed overlap [6,18] but they do not perform end-to-end diarization directly. TS-VAD [6] which uses speaker embeddings as conditioning inputs to a neural network was inspired from a speaker conditioned VAD approach [19].…”
Section: Related Workmentioning
confidence: 99%
“…In recent years, some hybrid methods combining clustering and discriminative methods addressed overlap [6,18] but they do not perform end-to-end diarization directly. TS-VAD [6] which uses speaker embeddings as conditioning inputs to a neural network was inspired from a speaker conditioned VAD approach [19].…”
Section: Related Workmentioning
confidence: 99%
“…Lin et al proposed a long short-term memory (LSTM)-based similarity measurement for the clustering-based speaker diarization. Moreover, speech activity estimation based on neural networks has been proposed [28,29] that directly produce the speech activity form the acoustic feature. Kinoshita et al proposed all-neural model that jointly solves speaker diarization, source separation, and source counting and demonstrated the performance on real meeting scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…For example, target-speaker voice activity detection (TS-VAD) uses i-vectors to output the corresponding speakers' voice activities [21], but the number of speakers is fixed by the model architecture. Personal VAD [22] and VoiceFilter-Lite [23], which are based on d-vectors, have not such a limitation, but they assume that each speaker's d-vector is stored in the database in advance; thus they are not suited for speaker-independent diarization.…”
Section: End-to-end Diarization For Overlapping Speechmentioning
confidence: 99%