Speech Enhancement Using Deep Learning Methods: A Review

Yuliani, Asri Rizki; Amri, M. Faizal; Suryawati, Endang; Ramdan, Ade; Pardede, Hilman F.

doi:10.14203/jet.v21.19-26

Cited by 25 publications

(7 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generally, other evaluations of systems that perform speech separation or speech enhancements have been performed using simulated databases [72], which makes it more difficult to assess the extent to which these approaches can be applied in the real world. Moreover, it is even more difficult to find studies that assess the performance of the models under real dynamic conditions such as in mobile HRI scenarios.…”

Section: Human-robot Interactionmentioning

confidence: 99%

Automatic Detection of Dyspnea in Real Human–Robot Interaction Scenarios

Alvarado,

Grágeda,

Luzanto

et al. 2023

Sensors

View full text Add to dashboard Cite

A respiratory distress estimation technique for telephony previously proposed by the authors is adapted and evaluated in real static and dynamic HRI scenarios. The system is evaluated with a telephone dataset re-recorded using the robotic platform designed and implemented for this study. In addition, the original telephone training data are modified using an environmental model that incorporates natural robot-generated and external noise sources and reverberant effects using room impulse responses (RIRs). The results indicate that the average accuracy and AUC are just 0.4% less than those obtained with matched training/testing conditions with simulated data. Quite surprisingly, there is not much difference in accuracy and AUC between static and dynamic HRI conditions. Moreover, the beamforming methods delay-and-sum and MVDR lead to average improvement in accuracy and AUC equal to 8% and 2%, respectively, when applied to training and testing data. Regarding the complementarity of time-dependent and time-independent features, the combination of both types of classifiers provides the best joint accuracy and AUC score.

show abstract

Section: Human-robot Interactionmentioning

confidence: 99%

Automatic Detection of Dyspnea in Real Human–Robot Interaction Scenarios

Alvarado,

Grágeda,

Luzanto

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…DNNs have shown better performance at modeling complex nonlinearity and perform better on SE in severe noisy backgrounds that are highly non-stationary. SE has made large progress due to the recent growth of DNNs [15]. The study [16] compressed the models to reduce DNN size, regressionbased DNNs [17] are proposed for SE, DNN-based SE is proposed to examine performance across different speech datasets [18], time-frequency-based training objective is proposed for SE [19], and generalized gaussian distributions are VOLUME 4, 2016 proposed for SE using regression-based DNN [20].…”

Section: Introductionmentioning

confidence: 99%

Spatio-Temporal Features Representation Using Recurrent Capsules for Monaural Speech Enhancement

Ali,

Saleem,

Bourouis

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Single-channel speech enhancement is important for modern communication systems and has received a lot of attention. A convolutional neural network (CNN) successfully learns feature representations from speech spectrograms but loses spatial information due to distortion, which is important for humans to understand speech. Speech feature learning is an important ongoing research to capture higher-level representations of speech that go beyond conventional techniques. By considering the hierarchical structure and temporal relationships within speech signals, capsule networks (CapsNets) have the potential to provide more expressive and context-aware feature representations. By considering the advantages of CapNets over CNN, this study presents a model for monaural speech enhancement that keeps spatial information in a capsule and uses dynamic routing to pass it to higher layers. Dynamic routing replaces the pooling recurrent hidden states to get speech features from the outputs of the capsule. Leveraging long-term contexts provides identification of the target speaker. Therefore, a gated recurrent layer, gated recurrent unit (GRU), or long-short-term memory (LSTM), is placed above the CNN module and next to the capsule module in the architecture. This makes it viable to extract spatial features and long-term temporal dynamics. The suggested convolutional recurrent CapNet performs better compared to the models based on CNNs and recurrent neural networks. The suggested speech enhancement produces considerably better speech quality and intelligibility. With the LibriSpeech and VoiceBank+DEMAND databases, the suggested speech enhancement improves the intelligibility and quality by 18.33% and (0.94) 36.82% over the noisy mixtures.

show abstract

“…Deep learning was defined by [5] as models composed of multiple processing layers that learn representations of data with various levels of abstraction. These models have shown an exemplary performance in speech enhancement [6].…”

Section: Introductionmentioning

confidence: 99%

Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression

Nogales,

Caracuel-Cayuela,

García-Tejedor

2024

Applied Sciences

View full text Add to dashboard Cite

This paper presents an approach to enhancing the clarity and intelligibility of speech in digital communications compromised by various background noises. Utilizing deep learning techniques, specifically a Variational Autoencoder (VAE) with 2D convolutional filters, we aim to suppress background noise in audio signals. Our method focuses on four simulated environmental noise scenarios: storms, wind, traffic, and aircraft. The training dataset has been obtained from public sources (TED-LIUM 3 dataset, which includes audio recordings from the popular TED-TALK series) combined with these background noises. The audio signals were transformed into 2D power spectrograms, upon which our VAE model was trained to filter out the noise and reconstruct clean audio. Our results demonstrate that the model outperforms existing state-of-the-art solutions in noise suppression. Although differences in noise types were observed, it was challenging to definitively conclude which background noise most adversely affects speech quality. The results have been assessed with objective (mathematical metrics) and subjective (listening to a set of audios by humans) methods. Notably, wind noise showed the smallest deviation between the noisy and cleaned audio, perceived subjectively as the most improved scenario. Future work should involve refining the phase calculation of the cleaned audio and creating a more balanced dataset to minimize differences in audio quality across scenarios. Additionally, practical applications of the model in real-time streaming audio are envisaged. This research contributes significantly to the field of audio signal processing by offering a deep learning solution tailored to various noise conditions, enhancing digital communication quality.

show abstract

Speech Enhancement Using Deep Learning Methods: A Review

Cited by 25 publications

References 43 publications

Automatic Detection of Dyspnea in Real Human–Robot Interaction Scenarios

Automatic Detection of Dyspnea in Real Human–Robot Interaction Scenarios

Spatio-Temporal Features Representation Using Recurrent Capsules for Monaural Speech Enhancement

Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression

Contact Info

Product

Resources

About