2018
DOI: 10.48550/arxiv.1803.02353
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-level Attention Model for Weakly Supervised Audio Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
23
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(23 citation statements)
references
References 12 publications
0
23
0
Order By: Relevance
“…These include TUT Acoustic scenes [17], the domestic sounds dataset CHIME-Home [18], the environmental sound classification dataset ESC-50 [19], FSDKaggle [20], and AudioSet [10]. Of relevance to our approach, a number of prior works have employed deep learning for audio comprehension [21,22,23,24,25] on the AudioSet dataset. Our work differs from theirs in that we focus instead on the task of retrieval with natural language queries, rather than audio recognition.…”
Section: Related Workmentioning
confidence: 99%
“…These include TUT Acoustic scenes [17], the domestic sounds dataset CHIME-Home [18], the environmental sound classification dataset ESC-50 [19], FSDKaggle [20], and AudioSet [10]. Of relevance to our approach, a number of prior works have employed deep learning for audio comprehension [21,22,23,24,25] on the AudioSet dataset. Our work differs from theirs in that we focus instead on the task of retrieval with natural language queries, rather than audio recognition.…”
Section: Related Workmentioning
confidence: 99%
“…However, in practice, frame-level labeling is required. The following work [2,14,15,16] has addressed the issue. The TALNet [2] is one of the state-of-the-art efforts for AED with weakly labeled audio inputs, which has demonstrated strong performance for acoustic event tagging and localization at the same time.…”
Section: Related Workmentioning
confidence: 99%
“…One goal of the many proposed attention mechanisms is to discover the best approach for identifying the salient regions or features. Yu et al [14] proposed a multi-level attention model for weakly labelled audio classification that applies temporal attention to single-channel embedded feature maps. Li et al [15] proposed a multi-stream network with temporal attention in which the structure is composed of three streams, each containing a single temporal attention vector.…”
Section: Introductionmentioning
confidence: 99%