Weakly Labelled AudioSet Tagging With Attention Neural Networks

Kong, Qiuqiang; Yu, Changsong; Iqbal, Turab; Xu, Yong; Wang, Wenwu; Plumbley, Mark D.

doi:10.1109/taslp.2019.2930913

Cited by 80 publications

(68 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our best system achieves an AP of 0.904 for speech event detection and 0.898 for music event detection, outperforming the results reported in [9] or [13] for these specific event categories. It should be highlighted that, in contrast with our system, the cited works target every event category in the AudioSet ontology and use the whole AudioSet training sets, whereas we only consider two target categories.…”

Section: Average Precisionmentioning

confidence: 55%

“…The ontology and the dataset defined in Google AudioSet have already been used to carry out several works and evaluations, such as the last editions of the DCASE challenge [7]. The size of this dataset in both the number of utterances and the diversity of audio events draws a new paradigm for the development of machine learning-based AED systems, where some research has been already performed [10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Benito-Gorrón

Lozano-Díez

Toledano

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.

show abstract

Section: Average Precisionmentioning

confidence: 55%

Section: Introductionmentioning

confidence: 99%

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Benito-Gorrón

Lozano-Díez

Toledano

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…This manifests that attention modules at different levels are complementary and benefit each other. We also list the results on Audioset, which were reported by Google's benchmark [1] and the state-of-the-art methods [10,11,16,21]. Our 3-layers AT-SCA outperforms remarkably the listed methods in mAP and achieves a similar result in mAUC.…”

Section: Network Architecturementioning

confidence: 91%

“…Evidently, this paradigm is more suitable for learning weak labels. In this case, such as the state-of-the-art CNN-based MIL methods [10,11], CNNs serve as the feature extractor to learn representations for instances which are integrated into bag-level. Starting from an input spectrogram of size W × H × 1, the convolutional layer consisting of C-channel filters outputs a W × H × C feature map, which will be fed to the next convolutional layer to extract frequency-shift invariant features.…”

Section: Introductionmentioning

confidence: 99%

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention

Hong

Zou

Wang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

“…The embedding-level approaches add the pooling layer to the high-level representation generated by neural networks to obtain a contextual representation, and the clip-level probability is further acquired [15]. As mentioned in [14] and [17], the embedding-level approaches are preferable in terms of the clip-level performance, which is demonstrated on Audioset [18], [19].…”

Section: Introductionmentioning

confidence: 99%

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Lin

Wang

Liu

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Sound event detection (SED) is to recognize the presence of sound events in the segment of audio and detect their onset as well as offset. SED can be regarded as a supervised learning task when strong annotations (timestamps) are available during learning. However, due to the high cost of manual strong labeling data, it becomes crucial to introduce weakly supervised learning to SED, in which only weak annotations (clip-level annotations without timestamps) are available during learning.In this paper, we approach SED as a multiple instance learning (MIL) problem and utilize a neural network framework with an embedding-level pooling module to solve it. The pooling module, which aggregates a sequence of high-level features generated by the neural network feature encoder into a single contextual feature representation, enables the model to learn with only weak annotations. We explore the self-learning ability of different pooling modules on finer information and propose a specialized decision surface (SDS) for class-wise attention pooling (cATP) module. We analyze and explained why a cATP module with SDS is better than other typical pooling modules from the perspective of feature space. According to the cooccurrence of several categories in the multi-label classification task, we also propose a disentangled feature (DF) to reduce interference between categories, which optimizes the high-level feature space by disentangling it based on class-wise identifiable information in the training set and obtaining multiple different subspaces. Experiments show that our approach achieves stateof-art performance on Task4 of the DCASE2018 challenge. On this basis,

show abstract

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Cited by 80 publications

References 34 publications

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention

Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection

Contact Info

Product

Resources

About