2020
DOI: 10.1109/taslp.2020.2989575
|View full text |Cite
|
Sign up to set email alerts
|

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Abstract: Sound event detection (SED) is to recognize the presence of sound events in the segment of audio and detect their onset as well as offset. SED can be regarded as a supervised learning task when strong annotations (timestamps) are available during learning. However, due to the high cost of manual strong labeling data, it becomes crucial to introduce weakly supervised learning to SED, in which only weak annotations (clip-level annotations without timestamps) are available during learning.In this paper, we approa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
29
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 29 publications
(29 citation statements)
references
References 42 publications
0
29
0
Order By: Relevance
“…Wang [8] modified connectionist temporal classification (CTC) [9] in order to enable capturing long and short events effectively. An essential piece of work done in [10], analyzed the usage of attention-based high-level feature disentangling. Other studies in [11] analyzed the impact of several post-processing methods, specifically the estimation of an event-dependent threshold, in regards to WSSED.…”
Section: Introductionmentioning
confidence: 99%
“…Wang [8] modified connectionist temporal classification (CTC) [9] in order to enable capturing long and short events effectively. An essential piece of work done in [10], analyzed the usage of attention-based high-level feature disentangling. Other studies in [11] analyzed the impact of several post-processing methods, specifically the estimation of an event-dependent threshold, in regards to WSSED.…”
Section: Introductionmentioning
confidence: 99%
“…Reduction rate r in the event-aware module is set to 8. For backend processing, adaptive median filter [11] is used, where the median filter size for each event is set to 1/3 of its mean duration length. For consistency, all experiments are repeated 5 times.…”
Section: Datasets and Setupmentioning
confidence: 99%
“…However, existing models generally have fixed receptive fields, which we believe are not well suited to capture the large variability in inter-and intra-event class durations, shown in Table 1. To solve this issue, post-processing methods, such as adaptive median filter [11] and double thresholding [12] were proposed. But these hand-tailored methods need prior event knowledge, which may be difficult to estimate.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…There are two main MIL strategies, i.e. instance-level approach [9] and embedding-level approach [10,11]. The embedding-level approach integrates the instance-level feature representations into a bag-level contextual representation and then directly carries out bag-level classification, which shows a better performance [12].…”
Section: Introductionmentioning
confidence: 99%