ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053150
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Abstract: Acoustic event classification (AEC) and acoustic event detection (AED) refer to the task of detecting whether specific target events occur in audios. As long short-term memory (LSTM) leads to stateof-the-art results in various speech related tasks, it is employed as a popular solution for AEC as well. This paper focuses on investigating the dynamics of LSTM model on AEC tasks. It includes a detailed analysis on LSTM memory retaining, and a benchmarking of nine different pooling methods on LSTM models using 1.7… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(26 citation statements)
references
References 23 publications
0
26
0
Order By: Relevance
“…For this reason, usually the LSTM sequence Y is summarized in a single value Z, which can be seen as a segment level representation, prior to be fed to the classifier itself by means of a certain pooling mechanism [29,42]. In this work, we adopted the attention pooling strategy as it has been successfully used in other classification problems involving the modeling of temporal sequences [26][27][28][29][30][31][32][33][34]. The hypothesis behind this approach was that certain LSTM frames contained more cues about the task under consideration than other ones.…”
Section: Single-modal Attention Lstm-based Add Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…For this reason, usually the LSTM sequence Y is summarized in a single value Z, which can be seen as a segment level representation, prior to be fed to the classifier itself by means of a certain pooling mechanism [29,42]. In this work, we adopted the attention pooling strategy as it has been successfully used in other classification problems involving the modeling of temporal sequences [26][27][28][29][30][31][32][33][34]. The hypothesis behind this approach was that certain LSTM frames contained more cues about the task under consideration than other ones.…”
Section: Single-modal Attention Lstm-based Add Systemmentioning
confidence: 99%
“…Moreover, the incorporation of an attention mechanism in the LSTM framework generally improves the performance of these techniques, as it tries to learn the structure of the temporal sequences by modeling the particular relevance of each frame to the task under consideration [25]. Recently, these models have been successfully utilized for, among others, acoustic event detection [26], acoustic scene classification [27], automatic speech recognition [28], speech emotion recognition [29,30], cognitive load classification from speech [31,32] or speech intelligibility level classification [33,34]. To the best of our knowledge, the use of attentional LSTMs has not been previously explored in the literature neither for ADD systems based on eye-tracker signals nor based on speech signals.…”
Section: Introductionmentioning
confidence: 99%
“…In contrast, non-relevant frames should be diminished or even ignored, so the values of the corresponding weights should be small. This approach has been proposed with great success in other automatic learning problems that deal with temporal sequences [14,16,17,[19][20][21]25,45], including our previous works on the estimation of the intelligibility level [8,11].…”
Section: Attention Poolingmentioning
confidence: 99%
“…More recently, deep learning (DL) methods have been proposed for SIC as they have been proven to be very effective in several audio and speech-related tasks, such as acoustic event detection [14], automatic speech recognition [15], speech emotion recognition [16][17][18], cognitive load classification from speech [19,20], or deception detection from speech [21]. Recent studies propose the use of dense networks fed by features derived from the decomposition of log-mel spectrograms in temporal and frequency basis vectors [22], the use of convolutional neural networks and different spectro-temporal representations as input [23], or long short-term memory (LSTM) networks with MFCC as feature vectors [24] for multilevel or binary speech intelligibility classification.…”
Section: Introductionmentioning
confidence: 99%
“…There are two main MIL strategies, i.e. instance-level approach [9] and embedding-level approach [10,11]. The embedding-level approach integrates the instance-level feature representations into a bag-level contextual representation and then directly carries out bag-level classification, which shows a better performance [12].…”
Section: Introductionmentioning
confidence: 99%