Multi-level Attention Model for Weakly Supervised Audio Classification

Yu, Changsong; Barsim, Karim Said; Kong, Qiuqiang; Yang, Bin

doi:10.48550/arxiv.1803.02353

Cited by 24 publications

(23 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These include TUT Acoustic scenes [17], the domestic sounds dataset CHIME-Home [18], the environmental sound classification dataset ESC-50 [19], FSDKaggle [20], and AudioSet [10]. Of relevance to our approach, a number of prior works have employed deep learning for audio comprehension [21,22,23,24,25] on the AudioSet dataset. Our work differs from theirs in that we focus instead on the task of retrieval with natural language queries, rather than audio recognition.…”

Section: Related Workmentioning

confidence: 99%

Audio Retrieval with Natural Language Queries

Oncescu¹,

Koepke²,

Henriques³

et al. 2021

Preprint

View full text Add to dashboard Cite

We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the AUDIOCAPS and CLOTHO datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.

show abstract

Section: Related Workmentioning

confidence: 99%

Audio Retrieval with Natural Language Queries

Oncescu¹,

Koepke²,

Henriques³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, in practice, frame-level labeling is required. The following work [2,14,15,16] has addressed the issue. The TALNet [2] is one of the state-of-the-art efforts for AED with weakly labeled audio inputs, which has demonstrated strong performance for acoustic event tagging and localization at the same time.…”

Section: Related Workmentioning

confidence: 99%

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

Liang¹,

Shi²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs.

show abstract

“…One goal of the many proposed attention mechanisms is to discover the best approach for identifying the salient regions or features. Yu et al [14] proposed a multi-level attention model for weakly labelled audio classification that applies temporal attention to single-channel embedded feature maps. Li et al [15] proposed a multi-stream network with temporal attention in which the structure is composed of three streams, each containing a single temporal attention vector.…”

Section: Introductionmentioning

confidence: 99%

A Multi-Channel Temporal Attention Convolutional Neural Network Model for Environmental Sound Classification

Wang

Feng

Anderson

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, many attention-based deep neural networks have emerged and achieved state-of-the-art performance in environmental sound classification. The essence of attention mechanism is assigning contribution weights on different parts of features, namely channels, spectral or spatial contents, and temporal frames. In this paper, we propose an effective convolutional neural network structure with a multichannel temporal attention (MCTA) block, which applies a temporal attention mechanism within each channel of the embedded features to extract channel-wise relevant temporal information. This multi-channel temporal attention structure will result in a distinct attention vector for each channel, which enables the network to fully exploit the relevant temporal information in different channels. The datasets used to test our model include ESC-50 and its subset ESC-10, along with development sets of DCASE 2018 and 2019. In our experiments, MCTA performed better than the single-channel temporal attention model and the non-attention model with the same number of parameters. Furthermore, we compared our model with some successful attention-based models and obtained competitive results with a relatively lighter network.

show abstract

Multi-level Attention Model for Weakly Supervised Audio Classification

Cited by 24 publications

References 12 publications

Audio Retrieval with Natural Language Queries

Audio Retrieval with Natural Language Queries

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

A Multi-Channel Temporal Attention Convolutional Neural Network Model for Environmental Sound Classification

Contact Info

Product

Resources

About