2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953173
|View full text |Cite
|
Sign up to set email alerts
|

Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
89
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 111 publications
(92 citation statements)
references
References 14 publications
3
89
0
Order By: Relevance
“…generate a single-channel output, within which the filters can be either fixed or adaptive depending on the model design. The second category, which we refer to as the masking-based (MB) beamforming, estimates the FaS beamforming filters in frequency domain by estimating time-frequency (T-F) masks for the sources of interest [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. The T-F masks specify the dominance of each T-F bin and are used to calculate the spatial covariance features required to obtain optimal weights for beamformers such as minimum variance distortionless response (MVDR) [24] and generalized eigenvalue (GEV) beamformer [25].…”
Section: Introductionmentioning
confidence: 99%
“…generate a single-channel output, within which the filters can be either fixed or adaptive depending on the model design. The second category, which we refer to as the masking-based (MB) beamforming, estimates the FaS beamforming filters in frequency domain by estimating time-frequency (T-F) masks for the sources of interest [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. The T-F masks specify the dominance of each T-F bin and are used to calculate the spatial covariance features required to obtain optimal weights for beamformers such as minimum variance distortionless response (MVDR) [24] and generalized eigenvalue (GEV) beamformer [25].…”
Section: Introductionmentioning
confidence: 99%
“…For the estimation of the RTFṽ, we used a method based on eigenvalue decomposition with noise covariance whitening [21,22], and apply it to the output of WPE dereverberation, to reduce the effect of reverberation and noise from the estimation. For estimation of noise spatial covariance matrices, we assumed that each utterance had noise-only periods of 225 ms and 75 ms, respectively, at its beginning and ending parts, for REVERB, and we used noise masks estimated by a BLSTM network [23] for CHiME3. Table 1 summarizes the WERs of the observed signals (Obs) and the enhanced signals obtained after the first estimation iteration.…”
Section: Estimation Of Power Spectral Density and Rtfmentioning
confidence: 99%
“…using (17), (18) and (21). Employing this in (15) we can express the convolutional beamformer coefficients as where we expressedḠ using (17) and (21), and q using (23).…”
Section: Appendix: Unified Versus Factorized Solutionmentioning
confidence: 99%
“…When a microphone array is available, ASR performance can be greatly improved by employing multi-channel speech enhancement (SE) pre-processing with an ASR back-end trained on multi-condition training (MCT) data. For example, the combination of neural-network (NN) based time-frequency mask estimation with beamforming has been employed by all top systems in recent distant ASR challenges [3,4]. It is worth mentioning that multi-channel SE can improve ASR performance even without any retraining of the ASR back-end on the enhanced speech, which may be possible because they introduce only a few distortions to the processed signals.…”
Section: Introductionmentioning
confidence: 99%