In this work, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the cross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace, K0, and Γ pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a K0 pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.
In automatic speech recognition based on weighted-finite transducers, a static decoding graph HC • L • G is typically constructed. In this work, we first show how the size of the decoding graph can be reduced and the necessity of determinizing it can be eliminated by removing the ambiguity associated with transitions to the backoff state or states in G. We then show how the static construction can be avoided entirely by performing fast on-the-fly composition of HC and L • G. We demonstrate that speech recognition based on this on-the-fly composition approximately 80% more run-time than recognition based on the statically-expanded network R, which makes it competitive compared with other dynamic expansion algorithms that have appeared in the literature. Moreover, the dynamic algorithm requires a factor of approximately seven less main memory as the recognition based on the static decoding graph.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.