Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Sivasankaran, S.; Fohr, Dominique

doi:10.21437/interspeech.2018-1526

Cited by 21 publications

(32 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For each test case, we ran 200 simulations, and we report the results in two different metrics: the Gross Error Rate (GER, in %) and the Mean Absolute Error (MAE, in • ). The GER measures the percentage of DoA estimations whose error is larger than a threshold of 5 • , and the MAE measures the average estimation bias [24].…”

Section: Methodsmentioning

confidence: 99%

“…When a competing speaker exits, multiple source localization techniques [23] could be applied; however, the localization results remain non-discriminative, and further post-processing for speaker identification is needed. TSL in a multi-speaker environment surely requires prior information of the target, such as a keyword uttered by the speaker [24]. Nevertheless, the point is that TSL could be addressed from a different perspective by first performing target speaker separation before localization.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Target Speaker Localization Based on the Complex Watson Mixture Model and Time-Frequency Selection Neural Network

Wang

Yan

2018

Applied Sciences

View full text Add to dashboard Cite

Common sound source localization algorithms focus on localizing all the active sources in the environment. While the source identities are generally unknown, retrieving the location of a speaker of interest requires extra effort. This paper addresses the problem of localizing a speaker of interest from a novel perspective by first performing time-frequency selection before localization. The speaker of interest, namely the target speaker, is assumed to be sparsely active in the signal spectra. The target speaker-dominant time-frequency regions are separated by a speaker-aware Long Short-Term Memory (LSTM) neural network, and they are sufficient to determine the Direction of Arrival (DoA) of the target speaker. Speaker-awareness is achieved by utilizing a short target utterance to adapt the hidden layer outputs of the neural network. The instantaneous DoA estimator is based on the probabilistic complex Watson Mixture Model (cWMM), and a weighted maximum likelihood estimation of the model parameters is accordingly derived. Simulative experiments show that the proposed algorithm works well in various noisy conditions and remains robust when the signal-to-noise ratio is low and when a competing speaker exists. components, especially the direct sound components, in the observed signals [9]. These techniques include those based on the signal power [10], the coherent-to-diffuse ratio [11], and the speech presence probability [12].The motivation behind time-frequency weighting is the sparsity assumption of the signal spectra, or in other words, the assumption that only one source is active in each time frequency bin [13,14]. As suggested in [15], even if the signal is severely corrupted by noises or interferences, there exist target-dominant time-frequency regions that are sufficient enough for localization. It is also found that the human auditory system may perform source separation jointly with source localization [16]. The same motivation lies behind some fairly recent studies that perform robust source localization based on time-frequency masking [17][18][19], in which the authors propose the usage of Deep Neural Networks (DNNs) to perform mask estimation first. The masks are used to weight the narrowband estimates, and the combined algorithms achieve superior performance in adverse environments. Given that mask estimation is a well-defined task in monaural speech separation and large progress has been made with DNN [20], the combination of mask and the SRP-PHAT (or the MUSIC) algorithm is natural and straightforward. Using neural networks, there are also other practices to localize speech sources directly through spatial classification [21,22].The methods discussed above are directly applicable to TSL if the target speaker is the only speech source in the environment. When a competing speaker exits, multiple source localization techniques [23] could be applied; however, the localization results remain non-discriminative, and further post-processing for speaker identification is needed. TSL in a multi-speaker environ...

show abstract

Section: Methodsmentioning

confidence: 99%

mentioning

confidence: 99%

Target Speaker Localization Based on the Complex Watson Mixture Model and Time-Frequency Selection Neural Network

Wang

Yan

2018

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Step 1, Estimating the first DOA: In the first step we estimate the DOA of a first speaker using a neural network. The cosines and sines of the phase differences between all pairs of microphones [6,14], called cosine-sine interchannel phase difference (CSIPD) features, and the short-term magnitude spectrum of one of the channels (in the following, channel 1) are used as input features (see Section 3.2):…”

Section: Estimating the Sourcesmentioning

confidence: 99%

“…In this work we use a similar idea wherein we compute a remainder mask (1 − M 1 ) but instead of appending it to the network inputs we multiply the CSIPD and magnitude spectrum features with the remainder mask before feeding them as input to the following DOA estimation and mask estimation stages. Indeed, mask multiplication was shown to perform better than mask concatenation for speaker localization [14].…”

Section: Estimating the Sourcesmentioning

confidence: 99%

SLOGD: Speaker Location Guided Deflation Approach to Speech Separation

Sivasankaran

Fohr

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech separation is the process of separating multiple speakers from an audio recording. In this work we propose to separate the sources using a Speaker LOcalization Guided Deflation (SLOGD) approach wherein we estimate the sources iteratively. In each iteration we first estimate the location of the speaker and use it to estimate a mask corresponding to the localized speaker. The estimated source is removed from the mixture before estimating the location and mask of the next source. Experiments are conducted on a reverberated, noisy multichannel version of the well-studied WSJ-2MIX dataset using word error rate (WER) as a metric. The proposed method achieves a WER of 44.2 %, a 34% relative improvement over the system without separation and 17% relative improvement over Conv-TasNet.

show abstract

“…The experiments exposed in this paper confirm the complementary role of standard home automation sensors and microphones and should orient future research towards probabilistic sequential models (e.g., CRF, HMM, RNN) and deep models which were difficult to start due to the shortage of data. Furthermore, the audio channel has been under-exploited in the pervasive community field while it can provide transient but accurate localization of a speaker [39]. This pleads for future research considering the whole information available at hand, which the VocADom@A4H corpus can support with its rich set of data.…”

Section: B Multi-sensor Multi-resident Localizationmentioning

confidence: 99%

Context-Aware Voice-Based Interaction in Smart Home - VocADom@A4H Corpus Collection and Empirical Assessment of Its Usefulness

Portet

Caffiau

Ringeval

et al. 2019

2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf

View full text Add to dashboard Cite

Smart homes aim at enhancing the quality of life of people at home by the use of home automation systems and Ambient Intelligence. Most of these smart homes provide enhanced interaction by relying on context-aware systems learned on data. Whereas voice-based interaction is the current emerging trend, most available corpora are either concerned only with home automation sensors or only with audio technology, which limits the development of context-aware voice-based systems. This paper presents the VocADom@A4H corpus, which is a dataset composed of users' interactions recorded in a fully equipped Smart Home. About 12 hours of multichannel distant speech signal synchronized with logs of an openHAB home automation system were collected from 11 participants who performed activities of daily living with the presence of real-life noises, such as other persons speaking, use of vacuum cleaner, TV, etc. This corpus can serve as a valuable material for studies in pervasive intelligence, such as human tracking, human activity recognition, context aware interaction, and robust distant speech processing in the home. Experiments performed on multichannel speech and home automation sensors data for robust voice activity detection and multiresident localization show the potential of the corpus to support the development of context-aware smart home systems.

show abstract

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Cited by 21 publications

References 24 publications

Target Speaker Localization Based on the Complex Watson Mixture Model and Time-Frequency Selection Neural Network

Target Speaker Localization Based on the Complex Watson Mixture Model and Time-Frequency Selection Neural Network

SLOGD: Speaker Location Guided Deflation Approach to Speech Separation

Context-Aware Voice-Based Interaction in Smart Home - VocADom@A4H Corpus Collection and Empirical Assessment of Its Usefulness

Contact Info

Product

Resources

About