Automatic speech emotion recognition has been a research hotspot in the field of human-computer interaction over the past decade. However, due to the lack of research on the inherent temporal relationship of the speech waveform, the current recognition accuracy needs improvement. To make full use of the difference of emotional saturation between time frames, a novel method is proposed for speech recognition using frame-level speech features combined with attention-based long short-term memory (LSTM) recurrent neural networks. Frame-level speech features were extracted from waveform to replace traditional statistical features, which could preserve the timing relations in the original speech through the sequence of frames. To distinguish emotional saturation in different frames, two improvement strategies are proposed for LSTM based on the attention mechanism: first, the algorithm reduces the computational complexity by modifying the forgetting gate of traditional LSTM without sacrificing performance and second, in the final output of the LSTM, an attention mechanism is applied to both the time and the feature dimension to obtain the information related to the task, rather than using the output from the last iteration of the traditional algorithm. Extensive experiments on the CASIA, eNTERFACE, and GEMEP emotion corpora demonstrate that the performance of the proposed approach is able to outperform the state-of-the-art algorithms reported to date.
Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.
Because traditional single-channel speech enhancement algorithms are sensitive to the environment and perform poorly, a speech enhancement algorithm based on attention-gated long short-term memory (LSTM) is proposed. To simulate human auditory perceptual characteristics, the algorithm divides the frequency band according to the Bark scale. Based on these bands, bark frequency cepstral coefficients (BFCCs), their derivative features and pitch-based features are extracted. Furthermore, considering that different noises have different influence on the clean speech, the attention mechanism is applied to screen out the information less polluted by noise, which is helpful to reconstruct the clean speech. To adaptively reallocate the power ratio of the speech and noise during the construction of the ratio mask, the ideal ratio mask (IRM) with the inter-channel correlation (ICC) is adopted as the learning target. In addition, to improve the performance of the network, the algorithm introduces a multiobjective learning strategy to jointly optimize the networks by using a voice activity detector (VAD). Subjective and objective experiments show that the proposed algorithm outperforms other baseline algorithms. In real-time experiment, the proposed algorithm maintains high real-time performance and fast convergence speed. INDEX TERMS Speech enhancement, long short-term memory, attention mechanism, bark scale.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.