Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1526
|View full text |Cite
|
Sign up to set email alerts
|

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Abstract: To cite this version:Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. Interspeech 2018 -19th AbstractSpeaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distantmicrophone voice command system, in the presence of overlappin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
31
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(32 citation statements)
references
References 24 publications
0
31
0
1
Order By: Relevance
“…For each test case, we ran 200 simulations, and we report the results in two different metrics: the Gross Error Rate (GER, in %) and the Mean Absolute Error (MAE, in • ). The GER measures the percentage of DoA estimations whose error is larger than a threshold of 5 • , and the MAE measures the average estimation bias [24].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…For each test case, we ran 200 simulations, and we report the results in two different metrics: the Gross Error Rate (GER, in %) and the Mean Absolute Error (MAE, in • ). The GER measures the percentage of DoA estimations whose error is larger than a threshold of 5 • , and the MAE measures the average estimation bias [24].…”
Section: Methodsmentioning
confidence: 99%
“…When a competing speaker exits, multiple source localization techniques [23] could be applied; however, the localization results remain non-discriminative, and further post-processing for speaker identification is needed. TSL in a multi-speaker environment surely requires prior information of the target, such as a keyword uttered by the speaker [24]. Nevertheless, the point is that TSL could be addressed from a different perspective by first performing target speaker separation before localization.…”
mentioning
confidence: 99%
“…Step 1, Estimating the first DOA: In the first step we estimate the DOA of a first speaker using a neural network. The cosines and sines of the phase differences between all pairs of microphones [6,14], called cosine-sine interchannel phase difference (CSIPD) features, and the short-term magnitude spectrum of one of the channels (in the following, channel 1) are used as input features (see Section 3.2):…”
Section: Estimating the Sourcesmentioning
confidence: 99%
“…In this work we use a similar idea wherein we compute a remainder mask (1 − M 1 ) but instead of appending it to the network inputs we multiply the CSIPD and magnitude spectrum features with the remainder mask before feeding them as input to the following DOA estimation and mask estimation stages. Indeed, mask multiplication was shown to perform better than mask concatenation for speaker localization [14].…”
Section: Estimating the Sourcesmentioning
confidence: 99%
“…The experiments exposed in this paper confirm the complementary role of standard home automation sensors and microphones and should orient future research towards probabilistic sequential models (e.g., CRF, HMM, RNN) and deep models which were difficult to start due to the shortage of data. Furthermore, the audio channel has been under-exploited in the pervasive community field while it can provide transient but accurate localization of a speaker [39]. This pleads for future research considering the whole information available at hand, which the VocADom@A4H corpus can support with its rich set of data.…”
Section: B Multi-sensor Multi-resident Localizationmentioning
confidence: 99%