End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Zhang, Wangyou; Subramanian, Arvind; Chang, Xuankai; Watanabe, Shinji; Ye, Qian

doi:10.21437/interspeech.2020-2432

Cited by 29 publications

(18 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ASR backend is a joint connectionist temporal classification (CTC) / attention-based encoder-decoder [13] model for recognizing the separated singlechannel speech. Compared to those in our previous work [22], the proposed architecture can support different beamformer variants in a single framework, by using a single mask estimator for WPE / beamforming and applying single-source WPE for processing speech of different sources.…”

Section: Pit-based Lossmentioning

confidence: 99%

“…The numerical problem generally originates from the complex operations in the WPE and beamforming formulas, such as the complex matrix inverse, leading to poor performance in certain frequency bins sparsely populated. Such behaviors are particularly undesirable in the joint training with ASR, as they can easily result in not-a-number (NaN) gradients that fail to backpropagate correctly and even prevent the model from converging properly [22], thus badly impacting the overall model performance. In order to mitigate this problem, we propose four approaches to improve the stability of both WPE and beamforming submodules:…”

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

“…(3) More stable complex matrix operations Due to the lack of complex support in PyTorch, the alternative method in Section 4.3 in [35] was used in our previous work [22], which tries to find a factor to construct an invertible real matrix and maps the complex inversion to some real matrix operations. But it sometimes fails due to the poor estimate of the factor that results in a singular matrix.…”

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

“…With the above proposed techniques, we are now able to optimize the convolutional beamformer and ASR jointly, without the need of pretraining as in [22].…”

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

See 3 more Smart Citations

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Zhang

Boeddeker²,

Watanabe

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both singlechannel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel multi-speaker reverberant condition, and propose to extend our previous framework for end-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend subnetworks including voice activity detection like masks. The techniques significantly stabilize the end-to-end training process. The experiments on the spatialized wsj1-2mix corpus show that the proposed system achieves about 35% WER relative reduction compared to our conventional multi-channel E2E ASR system, and also obtains decent speech dereverberation and separation performance (SDR = 12.5 dB) in the reverberant multi-speaker condition while trained only with the ASR criterion.

show abstract

Section: Pit-based Lossmentioning

confidence: 99%

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

“…With the above proposed techniques, we are now able to optimize the convolutional beamformer and ASR jointly, without the need of pretraining as in [22].…”

Section: Attacking the Numerical Instability Issuementioning

confidence: 99%

See 2 more Smart Citations

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Zhang

Boeddeker²,

Watanabe

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many useful techniques have been proposed to estimate masks, e.g., by neural networks (NNs) [3,4] and clustering microphone array signals [5,6]. The mask-based BF approach effectively optimizes BFs and Convolutional BFs (CBFs) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS) [7,8]. A drawback of this approach, however, is that ATFs and BFs are estimated based on different criteria, and thus the estimated ATFs are not guaranteed to be optimal for BF/CBF estimation.…”

Section: Introductionmentioning

confidence: 99%

Blind and Neural Network-Guided Convolutional Beamformer for Joint Denoising, Dereverberation, and Source Separation

Nakatani

Ikeshita

Kinoshita

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractable, we incorporate two techniques into the approach: the Source-Wise Factorization (SW-Fact) of a CBF and the Independent Vector Extraction (IVE). To further improve the performance, we develop a method that integrates a neural network (NN) based source power spectra estimation with CBF optimization by an inverse-Gamma prior. Experiments using noisy reverberant mixtures reveal that our proposed method with both blind and NNguided scenarios greatly outperforms the conventional state-of-theart NN-supported mask-based CBF in terms of the improvement in automatic speech recognition and signal distortion reduction performance.

show abstract

L-SpEx: Localized Target Speaker Extraction

Wang

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Cited by 29 publications

References 36 publications

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Blind and Neural Network-Guided Convolutional Beamformer for Joint Denoising, Dereverberation, and Source Separation

L-SpEx: Localized Target Speaker Extraction

Contact Info

Product

Resources

About