2021
DOI: 10.48550/arxiv.2106.03903
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-cha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 27 publications
0
4
0
Order By: Relevance
“…For example, the authors of [186] placed an 8-head attention layer after a series of convolutional layers to track the source location predictions over time for different sources (up to two sources in their experiments). In [187], Schymura et al used three 4-head self-attention encoders along the time axis after a series of convolutional layers before estimating the activity and location of several sound events (see Fig. 6).…”
Section: F Attention-based Neural Networkmentioning
confidence: 99%
See 2 more Smart Citations
“…For example, the authors of [186] placed an 8-head attention layer after a series of convolutional layers to track the source location predictions over time for different sources (up to two sources in their experiments). In [187], Schymura et al used three 4-head self-attention encoders along the time axis after a series of convolutional layers before estimating the activity and location of several sound events (see Fig. 6).…”
Section: F Attention-based Neural Networkmentioning
confidence: 99%
“…Several systems consider only the magnitude spectrograms, such as [52], [140], [199], [204], while other consider only the phase spectrogram [128], [203] When considering both magnitude and phase, they can be stacked also in a third dimension (as well as channels). This representation has been employed in many neural-based SSL systems [41], [70], [131], [143], [147], [148], [152], [153], [187]. Other systems proposed to decompose the complexvalued spectrograms into real and imaginary parts [42], [119], [192], [205].…”
Section: Spectrogram-based Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Various input features were proposed to be used with ANN-based sound source localization methods, such as interaural level, phase or time difference [16], [18], [51], [52], phase transform-based features [38], [53]- [56], magnitude and phase spectrograms of array signals [35], [41], [57]- [59] and even unprocessed audio waveforms [20], [60]- [66].…”
Section: Introductionmentioning
confidence: 99%