Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10929
|View full text |Cite
|
Sign up to set email alerts
|

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Abstract: Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the targetspeaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either vi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 56 publications
0
5
0
Order By: Relevance
“…For the following investigation, we use the recognition result for the Libri-CSS [7] dataset from a target-speaker separation (TS-SEP) [17] model followed by the base model from Whisper [18] as a single-speaker speech recognizer which delivers word boundaries we need for our analysis. This system generates diarization-style output.…”
Section: Choosing a Collar: Approaching The Desired Wermentioning
confidence: 99%
“…For the following investigation, we use the recognition result for the Libri-CSS [7] dataset from a target-speaker separation (TS-SEP) [17] model followed by the base model from Whisper [18] as a single-speaker speech recognizer which delivers word boundaries we need for our analysis. This system generates diarization-style output.…”
Section: Choosing a Collar: Approaching The Desired Wermentioning
confidence: 99%
“…If a cluster with a smaller amount of activity intersects more than 50 % with a cluster with a larger amount of activity and more than one element of the TDOA vectors of both clusters match each other (see hyperboloild property of TDOA vectors described above), the cluster with the smaller amount of activity is discarded. After all, a dilation and an erosion filter are applied to the estimated activities to smooth the activity estimates [19].…”
Section: Tdoa Clusteringmentioning
confidence: 99%
“…We utilize an MVDR beamformer in the formulation of [25], [26] to extract the signals of the single speakers. Therefore, we first re-segment the segments used for GSS based on the target speakers' activities, which are calculated from the estimated prior probabilities of the spatial mixture model as described in [19]. The beamforming coefficients are calculated for each resulting segment, defined by continuous activity of the target speaker, whose signal should be extracted.…”
Section: B Beamformingmentioning
confidence: 99%
“…Many studies have been proposed to improve different aspects of the CSS framework [3,4,5,6,7,8,9,10]. We introduced a modulation factor based on segment overlap ratio to dynamically adjust the separation loss [3].…”
Section: Introductionmentioning
confidence: 99%
“…Saijo et al proposed a spatial loss that uses the estimated direction of arrival to impose spatial constraints on the demixing matrix [9]. Boeddeker et al [10] introduced an initialization scheme for a beamformer based on a complex Angular Central Gaussian Mixture model.…”
Section: Introductionmentioning
confidence: 99%