STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Politis, Archontis; Shimada, Kazuaki; Sudarsanam, Parthasaarathy; Adavanne, Sharath; Krause, Daniel; Koyama, Yuichiro; Takahashi, Shun; Mitsufuji, Yuki; Virtanen, Tuomas

doi:10.48550/arxiv.2206.01948

Cited by 4 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We adopted the development sets of DCASE Task 3 from 2020 to 2022 [14][15][16] to compare the proposed method with other SELD approaches [6,11,12]. Each includes 14, 12, and 13 sound event classes respectively, which are loosely shared.…”

Section: Experimental Setupsmentioning

confidence: 99%

“…We adapt the framework of "You Only Look Once" (YOLO) [13], renowned for multiple object detection from images, to the SELD by using the notion of angular distance, namely proposing angular-distance-based YOLO (AD-YOLO). The results of an experiment using the series of DCASE 2020-2022 Task 3 (SELD) datasets [14][15][16] demonstrated that AD-YOLO outperformed existing SELD formats in both overall evaluation and polyphonic circumstances.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection

Kim¹,

Park²,

Shin³

et al. 2023

Preprint

View full text Add to dashboard Cite

Sound event localization and detection (SELD) combines the identification of sound events with the corresponding directions of arrival (DOA). Recently, event-oriented track output formats have been adopted to solve this problem; however, they still have limited generalization toward real-world problems in an unknown polyphony environment. To address the issue, we proposed an angular-distance-based multiple SELD (AD-YOLO), which is an adaptation of the "You Only Look Once" algorithm for SELD. The AD-YOLO format allows the model to learn sound occurrences location-sensitively by assigning class responsibility to DOA predictions. Hence, the format enables the model to handle the polyphony problem, regardless of the number of sound overlaps. We evaluated AD-YOLO on DCASE 2020-2022 challenge Task 3 datasets using four SELD objective metrics. The experimental results show that AD-YOLO achieved outstanding performance overall and also accomplished robustness in classhomogeneous polyphony environments.

show abstract

Section: Experimental Setupsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection

Kim¹,

Park²,

Shin³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…DCASE 2019–2021 were conducted using the synthesized data. However, DCASE 2022 Task 3 differs from previous competitions in that it includes a relatively small amount of real spatial acoustic scene data and a relatively large amount of synthetic data generated using specific indoor impulse responses [ 17 ]. There is a difference with the dataset.…”

Section: Introductionmentioning

confidence: 99%

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Shin

Chun

2023

Sensors

View full text Add to dashboard Cite

This study proposes a sound event localization and detection (SELD) method using imbalanced real and synthetic data via a multi-generator. The proposed method is based on a residual convolutional neural network (RCNN) and a transformer encoder for real spatial sound scenes. SELD aims to classify the sound event, detect the onset and offset of the classified event, and estimate the direction of the sound event. In Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Task 3, SELD is performed with a few real spatial sound scene data and a relatively large number of synthetic data. When a model is trained using imbalanced data, it can proceed by focusing only on a larger number of data. Thus, a multi-generator that samples real and synthetic data at a specific rate in one batch is proposed to prevent this problem. We applied the data augmentation technique SpecAugment and used time-frequency masking to the dataset. Furthermore, we propose a neural network architecture to apply the RCNN and transformer encoder. Several models were trained with various structures and hyperparameters, and several ensemble models were obtained by “cherry-picking” specific models. Based on the experiment, the single model of the proposed method and the model applied with the ensemble exhibited improved performance compared with the baseline model.

show abstract

“…Especially, labeling DOAs cannot be achieved using audio input alone but requires additional inputs such as optical tracking data and 360°videos. A new real-recorded SELD dataset has been released for the DCASE2022 SELD Challenge [65]. However, the size of this dataset is still small, with a total recording length of around 7 hours.…”

Section: Challenges In Seldmentioning

confidence: 99%

“…More importantly, multichannel audio data are dependent on the array geometry and cannot easily be shared among different applications. Examples of some publicly available datasets for multichannel SED are TUT Sound Events 2016 [77], TAU-NIGENS Spatial Sound Events 2021 [48], and Sony-TAU Realistic Spatial Soundscapes 2022 [65]. One method to improve the multichannel SED performance is transfer learning from single-channel SED models [78].…”

Section: Network Architecture and Datasetsmentioning

confidence: 99%

Input features for deep learning-based polyphonic sound event localization and detection

Nguyen¹

View full text Add to dashboard Cite

The contributions of the co-authors are as follows:• Prof Gan provided the initial project direction, checked the milestones, discussed the experimental results with me, guided the manuscript preparation, and edited the manuscript drafts. • Prof Jones contributed to the code and discussed the experimental results with me. • Dr Ranjan gave suggestions about the network architecture.• I collected data, developed the code, performed the experiments, analyzed the experimental results, and prepared the manuscript drafts.Chapter 4 is published as T.

show abstract

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Cited by 4 publications

References 0 publications

AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection

AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Input features for deep learning-based polyphonic sound event localization and detection

Contact Info

Product

Resources

About