In this paper we present a speech-recognizer-based maximum-likelihood beamforming technique, that can be used both for signal enhancement and speaker separation. The presented techniques uses an HMM-based speech recognizer as a statistical model for the target signal to be enhanced or separated. The parameters of a filter-and-sum array processor are estimated to maximize the likelihood of the output as measured using the speech recognizer. The filter-andsum operation may be performed either in the time domain or the frequency domain. When used for speaker separation, the beamforming must be performed individually for each of the speakers. Since the competing signal is also in-domain speech in this case, the statistical model used for the beamforming is now a factorial HMM formed from the HMM for the target, and that for the competing speakers(s).This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.
AbstractIn this paper we present a speech-recognizer-based maximumlikelihood beamforming technique, that can be used both for signal enhancement and speaker separation. The presented technique uses an HMM-based speech recognizer as a statistical model for the target signal to be enhanced or separated. The parameters of a filter-and-sum array processor are estimated to maximize the likelihood of the output as measured using the speech recognizer. The filter-and-sum operation may be performed either in the time domain or the frequency domain. When used for speaker separation, the beamforming must be performed individually for each of the speakers. Since the competing signal is also in-domain speech in this case, the statistical model used for the beamforming is now a factorial HMM formed from the HMM for the target, and that for the competing speaker(s).