2021
DOI: 10.48550/arxiv.2110.02011
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection

Abstract: Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem, which inevitably brings a trade-off between event boundary detection and audio tagging when using weakly labeled data to train the model. Besides, it needs post-processing and cannot be trained in an end-to-end way. This paper firstly presents the 1D Detection … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(13 citation statements)
references
References 45 publications
0
10
0
Order By: Relevance
“…The proposed SP-SEDT is pre-trained on the DCASE2018 task5 development dataset (72984 clips), and the unlabeled subset (14412 clips) of the DCASE2019 task4 dataset. The weakly-labeled subset and the synthetic subset with strong labels (2045 clips) of the DCASE2019 task4 dataset are used to fine-tune the pre-trained model with the experiment setup of [7]. The DCASE2019 task4 dataset is the same with the DCASE2021 task4 dataset except for the synthetic subset.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The proposed SP-SEDT is pre-trained on the DCASE2018 task5 development dataset (72984 clips), and the unlabeled subset (14412 clips) of the DCASE2019 task4 dataset. The weakly-labeled subset and the synthetic subset with strong labels (2045 clips) of the DCASE2019 task4 dataset are used to fine-tune the pre-trained model with the experiment setup of [7]. The DCASE2019 task4 dataset is the same with the DCASE2021 task4 dataset except for the synthetic subset.…”
Section: Methodsmentioning
confidence: 99%
“…Event-level loss: The matching between target and predicted events can be obtained by the Hungarian algorithm and oneto-many strategy [7]. For each event, the loss function includes location loss and classification loss:…”
Section: Loss Functionmentioning
confidence: 99%
See 1 more Smart Citation
“…Low-level feature extraction involves generating basic features and cross-scale path embedding to enhance fine-grained details in video frame interpolation [27]. These encompass color features (such as histograms, color moments, and color spaces like RGB, HSV, & LAB), temporal features (such as texture and STFT), shape features (such as object trajectory and silhouette), and motion features (such as vectors, frame differencing, and optical flow) [28].…”
Section: B Low-level Features Extractionmentioning
confidence: 99%
“…They are often heavily manually designed with plenty of heuristics and data-specific parameter optimization, hence less scalable and reliable across different audio data. Event-level approaches, on the other hand, directly model the temporal boundaries of sound events, taking into account the correlation between frames, thereby eliminating the mundane post-processing step and are more generalizable (Ye et al 2021). In both approaches, existing methods rely on proposal prediction by regressing the start and end times of each, i.e., discriminative learning based.…”
Section: Introductionmentioning
confidence: 99%