Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed

Norouzian, Atta; Mazoure, Bogdan; Connolly, Dermot; Willett, Daniel

doi:10.1109/icassp.2019.8683565

Cited by 22 publications

(37 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper continues our previous work in [1] where only acoustic features were used for utterance classification. Specifically, we present two methods that incorporate non-acoustic information into our models to improve upon our previous acoustic-only-based performance ; the first incorporates ASR decoder features in addition to the usual acoustic features, while the second further adds word embeddings as inputs to the final classification stage of the model.…”

Section: Introductionsupporting

confidence: 57%

“…Approaches for the classification of utterances into systemand non-system-directed ones typically use acoustic features extracted from the speech signal, e.g., [1,2,3,4,5]. Previous works [1,6,7] also show that using an attention mechanism *Author performed research herein as part of an internship-partnership program between Mila and Nuance. combined with a BiLSTM network can improve classification performance.…”

Section: Introductionmentioning

confidence: 99%

“…combined with a BiLSTM network can improve classification performance. Our previous work in [1] employs such a BiLSTM with attention, with acoustic features as inputs.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics

Vilaysouk

Nour-Eldin

Connolly

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we extend our previous work on the detection of system-directed speech utterances. This type of binary classification can be used by virtual assistants to create a more natural and fluid interaction between the system and the user. We explore two methods that both improve the Equal-Error-Rate (EER) performance of the previous model. The first exploits the supplementary information independently captured by ASR models through integrating ASR decoder-based features as additional inputs to the final classification stage of the model. This relatively improves EER performance by 13%. The second proposed method further integrates word embeddings into the architecture and, when combined with the first method, achieves a significant EER performance improvement of 48%, relative to that of the baseline.

show abstract

Section: Introductionsupporting

confidence: 57%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics

Vilaysouk

Nour-Eldin

Connolly

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The CNNs or the BiLSTMs layers generate a sequence of vectors for the classification process [ 32 ]. The attention layer is used to convert the sequence of vectors (frames) into a context vector, which attends some parts of the input sequence [ 33 , 34 ]. Figure 2 illustrates the role of the attention layer in our approach.…”

Section: The Proposed Frameworkmentioning

confidence: 99%

Agent Productivity Modeling in a Call Center Domain Using Attentive Convolutional Neural Networks

Ahmed

Toral

Shaalan

et al. 2020

Sensors

View full text Add to dashboard Cite

Measuring the productivity of an agent in a call center domain is a challenging task. Subjective measures are commonly used for evaluation in the current systems. In this paper, we propose an objective framework for modeling agent productivity for real estate call centers based on speech signal processing. The problem is formulated as a binary classification task using deep learning methods. We explore several designs for the classifier based on convolutional neural networks (CNNs), long-short-term memory networks (LSTMs), and an attention layer. The corpus consists of seven hours collected and annotated from three different call centers. The result shows that the speech-based approach can lead to significant improvements (1.57% absolute improvements) over a robust text baseline system.

show abstract

“…The sequence of vectors (frames) produced from CNN or LSTM and forwarded to the attention layer to convert them into a context vector [ 23 , 28 , 29 ]. The attention weight are forwarded to Softmax function at time t to generate the probability of the frame out of one to the remaining frames in the same speech segment.…”

Section: The Proposed Frameworkmentioning

confidence: 99%

A Multimodal Approach to Improve Performance Evaluation of Call Center Agent

Ahmed

Shaalan

Toral

et al. 2021

Sensors

View full text Add to dashboard Cite

The paper proposes three modeling techniques to improve the performance evaluation of the call center agent. The first technique is speech processing supported by an attention layer for the agent’s recorded calls. The speech comprises 65 features for the ultimate determination of the context of the call using the Open-Smile toolkit. The second technique uses the Max Weights Similarity (MWS) approach instead of the Softmax function in the attention layer to improve the classification accuracy. MWS function replaces the Softmax function for fine-tuning the output of the attention layer for processing text. It is formed by determining the similarity in the distance of input weights of the attention layer to the weights of the max vectors. The third technique combines the agent’s recorded call speech with the corresponding transcribed text for binary classification. The speech modeling and text modeling are based on combinations of the Convolutional Neural Networks (CNNs) and Bi-directional Long-Short Term Memory (BiLSTMs). In this paper, the classification results for each model (text versus speech) are proposed and compared with the multimodal approach’s results. The multimodal classification provided an improvement of (0.22%) compared with acoustic model and (1.7%) compared with text model.

show abstract

Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed

Cited by 22 publications

References 14 publications

Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics

Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics

Agent Productivity Modeling in a Call Center Domain Using Attentive Convolutional Neural Networks

A Multimodal Approach to Improve Performance Evaluation of Call Center Agent

Contact Info

Product

Resources

About