Broadband doa estimation using convolutional neural networks trained with noise signals

Chakrabarty, Soumitro; Habets, Emanuël A. P.

doi:10.1109/waspaa.2017.8170010

Cited by 228 publications

(211 citation statements)

References 11 publications

Supporting

Mentioning

198

Contrasting

Order By: Relevance

“…We refer to this spectrum as the reverberated spectrum identifier. The reverberated speech t R c is corrupted by late reverberation which is known to have a detrimental effect on single-speaker DOA estimation [17,22]. Therefore, we also consider the spectrum |T E | of the signal containing only the direct component and the early reflections of the target signal as an identifier and call it the early spectrum identifier.…”

Section: Spectrum-based Identifiersmentioning

confidence: 99%

“…To improve the robustness of DOA estimation, deep neural networks (DNNs) have been proposed to learn a mapping between signal features and a discretized DOA space [17][18][19][20][21]. Various features such as phasemaps [17,18] and GCC-PHAT [21] have been used as inputs.…”

Section: Introductionmentioning

confidence: 99%

“…Various features such as phasemaps [17,18] and GCC-PHAT [21] have been used as inputs. In [22], the cosines and sines of the frequency-wise phase differences between microphones, termed as cosine-sine interchannel phase difference (CSIPD) features, have been shown to perform as well as phasemaps for DOA estimation, despite their lower dimensionality.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Sivasankaran¹,

Fohr²

2018

Interspeech 2018

View full text Add to dashboard Cite

To cite this version:Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. Interspeech 2018 -19th AbstractSpeaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distantmicrophone voice command system, in the presence of overlapping speech. We employ a convolutional neural network based localization system and investigate multiple identifiers as additional inputs to the system in order to characterize this speaker.We conduct experiments using ground truth identifiers which are obtained assuming the availability of clean speech and also in realistic conditions where the identifiers are computed from the corrupted speech. We find that the identifier consisting of the ground truth time-frequency mask corresponding to the target speaker provides the best localization performance and we propose methods to estimate such a mask in adverse reverberant and noisy conditions using the considered keyword.

show abstract

Section: Spectrum-based Identifiersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Sivasankaran¹,

Fohr²

2018

Interspeech 2018

View full text Add to dashboard Cite

show abstract

“…In this paper, we want to exploit the capabilities of both QNNs and Ambisonics to analyze 3D sounds, and in particular we focus on the localization and detection of 3D sound events. Both tasks have been widely investigated recently by using convolutional neural networks (CNNs) [19][20][21][22][23][24][25]. They are also considered as a joint task in [26] for 3D sounds, but considering each microphone signal as a separate real-valued signal.…”

Section: Introductionmentioning

confidence: 99%

Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound Events

Comminiello

Lella

Scardapane

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Learning from data in the quaternion domain enables us to exploit internal dependencies of 4D signals and treating them as a single entity. One of the models that perfectly suits with quaternion-valued data processing is represented by 3D acoustic signals in their spherical harmonics decomposition. In this paper, we address the problem of localizing and detecting sound events in the spatial sound field by using quaternion-valued data processing. In particular, we consider the spherical harmonic components of the signals captured by a first-order ambisonic microphone and process them by using a quaternion convolutional neural network. Experimental results show that the proposed approach exploits the correlated nature of the ambisonic signals, thus improving accuracy results in 3D sound event detection and localization.

show abstract

“…Other neural network structures, such as CNNs, are neural networks designed to process data in the form of multiple arrays (such as images with three colours channels) and contain convolutional and pooling layers [5]. CNNs have been used to estimate the DOA for speech separation in [12] and trained using synthesized noise signals, but recorded with a four-microphones array. In this paper, we present a system that is able to perform source localization and source separation.…”

Section: Introductionmentioning

confidence: 99%

Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks

Zermini

Kong

et al. 2018

Latent Variable Analysis and Signal Separation

View full text Add to dashboard Cite

Abstract. Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).

show abstract

Broadband doa estimation using convolutional neural networks trained with noise signals

Cited by 228 publications

References 11 publications

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound Events

Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks

Contact Info

Product

Resources

About