Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr

Xu, Yong; Weng, Chao; Hui, Like; Liu, Jianming; Yu, Meng; Su, Dan; Yu, Dong

doi:10.1109/icassp.2019.8682576

Cited by 43 publications

(39 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, several researches have shown that integrating the multi-channel information collected by a microphone array can improve the mask estimation of the reference channel and lead to better speech separation. It has been found in previous research that the complex ratio masks (CRMs) outperform both the binary masks (BMs) and realvalue ratio masks (RMs) on speech separation [26], [43] and enhancement [44] tasks. For this reason, the CM based TF masking approach is implemented in this work.…”

Section: B Tf Maskingmentioning

confidence: 98%

“…In mask-based MVDR approaches, the deep neural networks are used to estimate the real-value [4], [5], [23] or complex [26] TF masks of the target speech M y (t, f ) and other interfering sources M n (t, f ) respectively. The PSD matrices corresponding to each source can be calculated with the estimated TF masks shown as follows:…”

Section: E Mask-based Mvdrmentioning

confidence: 99%

“…According to [25], [26], [40], [61], tight integration of the two components with joint fine-tuning can address above two Fig. 3: Joint fine-tuning: ∇L REC and ∇L Si−SN R represent the gradients of speech recognition i.e CTC, LF-MMI and speech separation SI-SNR objective functions respectively, "LFB" denotes log filter bank acoustic features.…”

Section: B Integration Of the Separation And Recognition Componentsmentioning

confidence: 99%

“…The mask-based MVDR [4]- [6], [20]- [23] and related mask-based GEV [24], [25] approaches predict the TF masks using DNNs before estimating the power spectral density (PSD) matrices for the target and overlapping speakers to obtain the beamforming filter parameters. Compared with the conventional stand-alone beamforming approaches, these neural based methods allow a arXiv:2011.07755v1 [eess.AS] 16 Nov 2020 tighter integration with the downstream recognition back-end [5], [6], [19], [25], [26]. Large performance improvements have been reported for overlapped speech recognition tasks by using microphone array based multi-channel inputs [5], [6].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Audio-Visual Multi-Channel Recognition of Overlapped Speech

Zhang

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to separate the target speaker from the interfering sound sources, this paper presents an audiovisual multi-channel based recognition system for overlapped speech. It benefits from a tight integration between a speech separation front-end and recognition back-end, both of which incorporate additional video input. A series of audiovisual multichannel speech separation front-end components based on TF masking, Filter&Sum and mask-based MVDR neural channel integration approaches are developed. To reduce the error cost mismatch between the separation and recognition components, the entire system is jointly fine-tuned using a multi-task criterion interpolation of the scale-invariant signal to noise ratio (Si-SNR) with either the connectionist temporal classification (CTC), or lattice-free maximum mutual information (LF-MMI) loss function. Experiments suggest that: the proposed audiovisual multichannel recognition system outperforms the baseline audio-only multi-channel ASR system by up to 8.04% (31.68% relative) and 22.86% (58.51% relative) absolute WER reduction on overlapped speech constructed using either simulation or replaying of the LRS2 dataset respectively. Consistent performance improvements are also obtained using the proposed audiovisual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%.

show abstract

Section: B Tf Maskingmentioning

confidence: 98%

Section: E Mask-based Mvdrmentioning

confidence: 99%

Section: B Integration Of the Separation And Recognition Componentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audio-Visual Multi-Channel Recognition of Overlapped Speech

Zhang

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speech enhancement is useful in many applications, such as speech recognition [1,2,3] and hearing aids [4,5]. Recently, the research community has witnessed a shift in methodology from conventional signal processing methods [6,7] to data-driven enhancement approaches, particularly those based on deep learning paradigms [8,9,3,10,11].…”

Section: Introductionmentioning

confidence: 99%

Self-Attention Generative Adversarial Network for Speech Enhancement

Phan

Nguyêݱn²,

Chén

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we empirically study the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices as well as at all of them when memory allows. Our experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance. Furthermore, applying at different (de)convolutional layers does not significantly alter performance, suggesting that it can be conveniently applied at the highest-level (de)convolutional layer with the smallest memory overhead 1 .

show abstract