Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2432
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Abstract: Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multisource mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture i… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 29 publications
(18 citation statements)
references
References 36 publications
0
18
0
Order By: Relevance
“…The ASR backend is a joint connectionist temporal classification (CTC) / attention-based encoder-decoder [13] model for recognizing the separated singlechannel speech. Compared to those in our previous work [22], the proposed architecture can support different beamformer variants in a single framework, by using a single mask estimator for WPE / beamforming and applying single-source WPE for processing speech of different sources.…”
Section: Pit-based Lossmentioning
confidence: 99%
See 3 more Smart Citations
“…The ASR backend is a joint connectionist temporal classification (CTC) / attention-based encoder-decoder [13] model for recognizing the separated singlechannel speech. Compared to those in our previous work [22], the proposed architecture can support different beamformer variants in a single framework, by using a single mask estimator for WPE / beamforming and applying single-source WPE for processing speech of different sources.…”
Section: Pit-based Lossmentioning
confidence: 99%
“…The numerical problem generally originates from the complex operations in the WPE and beamforming formulas, such as the complex matrix inverse, leading to poor performance in certain frequency bins sparsely populated. Such behaviors are particularly undesirable in the joint training with ASR, as they can easily result in not-a-number (NaN) gradients that fail to backpropagate correctly and even prevent the model from converging properly [22], thus badly impacting the overall model performance. In order to mitigate this problem, we propose four approaches to improve the stability of both WPE and beamforming submodules:…”
Section: Attacking the Numerical Instability Issuementioning
confidence: 99%
See 2 more Smart Citations
“…Many useful techniques have been proposed to estimate masks, e.g., by neural networks (NNs) [3,4] and clustering microphone array signals [5,6]. The mask-based BF approach effectively optimizes BFs and Convolutional BFs (CBFs) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS) [7,8]. A drawback of this approach, however, is that ATFs and BFs are estimated based on different criteria, and thus the estimated ATFs are not guaranteed to be optimal for BF/CBF estimation.…”
Section: Introductionmentioning
confidence: 99%