This paper 12 presents a novel approach to language model adaptation for speech recognition. We define mutual information histograms which account for different semantic and syntactic relations between words in text data. We introduce a novel word distance measure which is based on mutual information histograms. By using this measure we were able to create linguistically meaningful word clusters composed of words obtained in first-pass speech recognition. Words included in the clusters were used to adapt language models. Adapted language models were used for a second pass of speech recognition.We conducted experiments on the Fisher speech corpus of telephone conversations. Mutual information histograms for word pairs were estimated from the Fisher data as well as from data extracted from a corpus of New York Times articles. Results showed that word clusters conveyed significant information and could be helpful in improving speech recognition accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.