A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Kao, Chieh-Chi; Sun, Ming; Wang, Weiran; Wang, Chao

doi:10.1109/icassp40776.2020.9053150

Cited by 24 publications

(26 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this reason, usually the LSTM sequence Y is summarized in a single value Z, which can be seen as a segment level representation, prior to be fed to the classifier itself by means of a certain pooling mechanism [29,42]. In this work, we adopted the attention pooling strategy as it has been successfully used in other classification problems involving the modeling of temporal sequences [26][27][28][29][30][31][32][33][34]. The hypothesis behind this approach was that certain LSTM frames contained more cues about the task under consideration than other ones.…”

Section: Single-modal Attention Lstm-based Add Systemmentioning

confidence: 99%

“…Moreover, the incorporation of an attention mechanism in the LSTM framework generally improves the performance of these techniques, as it tries to learn the structure of the temporal sequences by modeling the particular relevance of each frame to the task under consideration [25]. Recently, these models have been successfully utilized for, among others, acoustic event detection [26], acoustic scene classification [27], automatic speech recognition [28], speech emotion recognition [29,30], cognitive load classification from speech [31,32] or speech intelligibility level classification [33,34]. To the best of our knowledge, the use of attentional LSTMs has not been previously explored in the literature neither for ADD systems based on eye-tracker signals nor based on speech signals.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework

Gallardo-Antolín

Montero

2021

Applied Sciences

View full text Add to dashboard Cite

The automatic detection of deceptive behaviors has recently attracted the attention of the research community due to the variety of areas where it can play a crucial role, such as security or criminology. This work is focused on the development of an automatic deception detection system based on gaze and speech features. The first contribution of our research on this topic is the use of attention Long Short-Term Memory (LSTM) networks for single-modal systems with frame-level features as input. In the second contribution, we propose a multimodal system that combines the gaze and speech modalities into the LSTM architecture using two different combination strategies: Late Fusion and Attention-Pooling Fusion. The proposed models are evaluated over the Bag-of-Lies dataset, a multimodal database recorded in real conditions. On the one hand, results show that attentional LSTM networks are able to adequately model the gaze and speech feature sequences, outperforming a reference Support Vector Machine (SVM)-based system with compact features. On the other hand, both combination strategies produce better results than the single-modal systems and the multimodal reference system, suggesting that gaze and speech modalities carry complementary information for the task of deception detection that can be effectively exploited by using LSTMs.

show abstract

Section: Single-modal Attention Lstm-based Add Systemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework

Gallardo-Antolín

Montero

2021

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…In contrast, non-relevant frames should be diminished or even ignored, so the values of the corresponding weights should be small. This approach has been proposed with great success in other automatic learning problems that deal with temporal sequences [14,16,17,[19][20][21]25,45], including our previous works on the estimation of the intelligibility level [8,11].…”

Section: Attention Poolingmentioning

confidence: 99%

“…More recently, deep learning (DL) methods have been proposed for SIC as they have been proven to be very effective in several audio and speech-related tasks, such as acoustic event detection [14], automatic speech recognition [15], speech emotion recognition [16][17][18], cognitive load classification from speech [19,20], or deception detection from speech [21]. Recent studies propose the use of dense networks fed by features derived from the decomposition of log-mel spectrograms in temporal and frequency basis vectors [22], the use of convolutional neural networks and different spectro-temporal representations as input [23], or long short-term memory (LSTM) networks with MFCC as feature vectors [24] for multilevel or binary speech intelligibility classification.…”

Section: Introductionmentioning

confidence: 99%

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

Gallardo-Antolín

Montero

2021

Symmetry

View full text Add to dashboard Cite

Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli’s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.

show abstract

“…There are two main MIL strategies, i.e. instance-level approach [9] and embedding-level approach [10,11]. The embedding-level approach integrates the instance-level feature representations into a bag-level contextual representation and then directly carries out bag-level classification, which shows a better performance [12].…”

Section: Introductionmentioning

confidence: 99%

A Global-Local Attention Framework for Weakly Labelled Audio Tagging

Wang

Zou

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures.

show abstract

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Cited by 24 publications

References 23 publications

Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework

Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework

An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification

A Global-Local Attention Framework for Weakly Labelled Audio Tagging

Contact Info

Product

Resources

About