Traditional stereophonic acoustic echo cancellation algorithms need to estimate acoustic echo paths from stereo loudspeakers to a microphone, which often suffers from the nonuniqueness problem caused by a high correlation between the two far-end signals of these stereo loudspeakers. Many decorrelation methods have already been proposed to mitigate this problem. However, these methods may reduce the audio quality and/or stereophonic spatial perception. This paper proposes to use a convolutional recurrent network (CRN) to suppress the stereophonic echo components by estimating a nonlinear gain, which is then multiplied by the complex spectrum of the microphone signal to obtain the estimated near-end speech without a decorrelation procedure. The CRN includes an encoder-decoder module and two-layer gated recurrent network module, which can take advantage of the feature extraction capability of the convolutional neural networks and temporal modeling capability of recurrent neural networks simultaneously. The magnitude spectra of the two far-end signals are used as input features directly without any decorrelation preprocessing and, thus, both the audio quality and stereophonic spatial perception can be maintained. The experimental results in both the simulated and real acoustic environments show that the proposed algorithm outperforms traditional algorithms such as the normalized least-mean square and Wiener algorithms, especially in situations of low signal-to-echo ratio and high reverberation time RT60.
This paper describes a three-stage acoustic echo cancellation (AEC) and suppression framework for the ICASSP 2021 AEC Challenge. In the first stage, a partitioned block frequency domain adaptive filtering is implemented to cancel the linear echo components without introducing the near-end speech distortion, where we compensate the time delay between the far-end reference signal and the microphone signal beforehand. In the second stage, a deep complex U-Net integrated with gated recurrent unit is proposed to further suppress the residual echo components. In the last stage, an extremely tiny deep complex U-Net is trained to suppress non-speech residual components that have not been suppressed completely in the second stage, which can also further increase the echo return loss enhancement (ERLE) without increasing the computational complexity dramatically. Experimental results show that the proposed three-stage framework can get the ERLE higher than 50 dB in both single-talk and double-talk scenarios, and perceptual evaluation of speech quality can be improved about 0.75 in double-talk scenarios. The proposed framework outperforms the AEC-Challenge baseline ResRNN by 0.12 points in terms of the MOS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.