Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Lin, Liwei; Wang, Xiangdong; Liu, Hong; Ye, Qian

doi:10.1109/taslp.2020.2989575

Cited by 29 publications

(29 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wang [8] modified connectionist temporal classification (CTC) [9] in order to enable capturing long and short events effectively. An essential piece of work done in [10], analyzed the usage of attention-based high-level feature disentangling. Other studies in [11] analyzed the impact of several post-processing methods, specifically the estimation of an event-dependent threshold, in regards to WSSED.…”

Section: Introductionmentioning

confidence: 99%

Duration Robust Weakly Supervised Sound Event Detection

Dinkel

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Task 4 of the DCASE2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) clips. This work aims to investigate the performance impact of fixed-sized window median filter post-processing and advocate the use of double thresholding as a more robust and predictable post-processing method. Further, four different temporal subsampling methods within the CRNN framework are proposed: mean-max, α-mean-max, L p -norm and convolutional. We show that for this task subsampling the temporal resolution by a neural network enhances the F1 score as well as its robustness towards short, sporadic sound events. Our best single model achieves 30.1% F1 on the evaluation set and the best fusion model 32.5%, while being robust to event length variations.

show abstract

Section: Introductionmentioning

confidence: 99%

Duration Robust Weakly Supervised Sound Event Detection

Dinkel

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Reduction rate r in the event-aware module is set to 8. For backend processing, adaptive median filter [11] is used, where the median filter size for each event is set to 1/3 of its mean duration length. For consistency, all experiments are repeated 5 times.…”

Section: Datasets and Setupmentioning

confidence: 99%

“…However, existing models generally have fixed receptive fields, which we believe are not well suited to capture the large variability in inter-and intra-event class durations, shown in Table 1. To solve this issue, post-processing methods, such as adaptive median filter [11] and double thresholding [12] were proposed. But these hand-tailored methods need prior event knowledge, which may be difficult to estimate.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection

Zheng

Yan

McLoughlin

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors' previous perturbation based MT method. Firstly, an event-aware module is designed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.

show abstract

“…There are two main MIL strategies, i.e. instance-level approach [9] and embedding-level approach [10,11]. The embedding-level approach integrates the instance-level feature representations into a bag-level contextual representation and then directly carries out bag-level classification, which shows a better performance [12].…”

Section: Introductionmentioning

confidence: 99%

A Global-Local Attention Framework for Weakly Labelled Audio Tagging

Wang

Zou

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures.

show abstract

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Cited by 29 publications

References 42 publications

Duration Robust Weakly Supervised Sound Event Detection

Duration Robust Weakly Supervised Sound Event Detection

An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection

A Global-Local Attention Framework for Weakly Labelled Audio Tagging

Contact Info

Product

Resources

About