2019
DOI: 10.1109/taslp.2019.2930913
|View full text |Cite
|
Sign up to set email alerts
|

Weakly Labelled AudioSet Tagging With Attention Neural Networks

Abstract: Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, while the onset and offset times are unknown. To address the w… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
67
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 80 publications
(68 citation statements)
references
References 34 publications
1
67
0
Order By: Relevance
“…Our best system achieves an AP of 0.904 for speech event detection and 0.898 for music event detection, outperforming the results reported in [9] or [13] for these specific event categories. It should be highlighted that, in contrast with our system, the cited works target every event category in the AudioSet ontology and use the whole AudioSet training sets, whereas we only consider two target categories.…”
Section: Average Precisionmentioning
confidence: 55%
See 1 more Smart Citation
“…Our best system achieves an AP of 0.904 for speech event detection and 0.898 for music event detection, outperforming the results reported in [9] or [13] for these specific event categories. It should be highlighted that, in contrast with our system, the cited works target every event category in the AudioSet ontology and use the whole AudioSet training sets, whereas we only consider two target categories.…”
Section: Average Precisionmentioning
confidence: 55%
“…The ontology and the dataset defined in Google AudioSet have already been used to carry out several works and evaluations, such as the last editions of the DCASE challenge [7]. The size of this dataset in both the number of utterances and the diversity of audio events draws a new paradigm for the development of machine learning-based AED systems, where some research has been already performed [10][11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…This manifests that attention modules at different levels are complementary and benefit each other. We also list the results on Audioset, which were reported by Google's benchmark [1] and the state-of-the-art methods [10,11,16,21]. Our 3-layers AT-SCA outperforms remarkably the listed methods in mAP and achieves a similar result in mAUC.…”
Section: Network Architecturementioning
confidence: 91%
“…Evidently, this paradigm is more suitable for learning weak labels. In this case, such as the state-of-the-art CNN-based MIL methods [10,11], CNNs serve as the feature extractor to learn representations for instances which are integrated into bag-level. Starting from an input spectrogram of size W × H × 1, the convolutional layer consisting of C-channel filters outputs a W × H × C feature map, which will be fed to the next convolutional layer to extract frequency-shift invariant features.…”
Section: Introductionmentioning
confidence: 99%
“…The embedding-level approaches add the pooling layer to the high-level representation generated by neural networks to obtain a contextual representation, and the clip-level probability is further acquired [15]. As mentioned in [14] and [17], the embedding-level approaches are preferable in terms of the clip-level performance, which is demonstrated on Audioset [18], [19].…”
Section: Introductionmentioning
confidence: 99%