How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

Igarashi, Ami; Imoto, Keisuke; Komatsu, Yuka; Tsubaki, Shunsuke; Hario, Shuto; Komatsu, Tatsuya

doi:10.23919/apsipaasc55919.2022.9979926

Cited by 3 publications

(1 citation statement)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sound event detection (SED) aims to temporally localize sound events of interest (i.e., the start and end time) and recognize their class labels in a long audio stream (Mesaros et al 2021). As a fundamental audio signal processing task, it has become the cornerstone of many related recognition scenarios, such as audio captioning (Xu et al 2021;Bhosale, Chakraborty, and Kopparapu 2023;Xie et al 2023), and acoustic scene understanding (Igarashi et al 2022;Bear, Nolasco, and Benetos 2019).…”

Section: Introductionmentioning

confidence: 99%

DiffSED: Sound Event Detection with Denoising Diffusion

Bhosale,

Nag,

Kanojia

et al. 2024

AAAI

View full text Add to dashboard Cite

Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split-and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED

show abstract