Contextual knowledge has traditionally been used in multi-sentential textual understanding systems. In contrast, this paper describes a new approach toward using contextual, dialog-based knowledge for speech recognition. To demonstrate this approach, we have built MINDS, a system which uses contextual knowledge to predictively generate expectations about the conceptual content that may be expressed in a system user's next utterance. These expectations are expanded to constrain the possible words which may be matched from an incoming speech signal. To prevent system rigidity and allow for diverse user behavior, the system creates layered predictions which range from very specific to very general. Each time new information becomes available from the ongoing dialog, MINDS generates a different set of layered predictions for processing the next utterance. The predictions contain constraints derived from the contextual, dialog level knowledge sources and each prediction is translated into a grammar usable by our speech recognizer, SPHINX. Since speech recognizers use grammars to dictate legal word sequences and to constrain the recognition process, the dynamically generated grammars reduce the number of word candidates considered by the recognizer. The results demonstrate that speech recognition accuracy is greatly enhanced through the use of predictions.
OVERVIEWOne of the primary problems in speech recognition research is effectively analyzing very large, complex search spaces. As search space size increases, recognition accuracy decreases. Previous research in the speech recognition area illustrates that knowledge can compensate for search by constraining the exponential growth of a search space and thus increasing recognition accuracy [12,4,8]. The most common approach to constraining a search space is to use a grammar. The grammars used for speech recognition dictate legal word sequences. Normally they are used in a strict left to right fashion and embody syntactic and semantic constraints on individual sentences. These constraints are represented in some form of probabilisfic or semantic network which does not change from utterance to utterance [2,8]. Today, state-of-the-art speech recognizers can achieve word accuracy rates in excess of 95% when using grammars of perplexity 30 -60. As the number of word alternatives at each point in time increases (or as perplexity increases) performance of these systems decreases rapidly. Given this level of performance, recently researchers have begun using speech in computer problem solving applications. Using speech as an input medium for computer applications has resulted in two important findings. First, the grammars necessary to ensure some minimal coverage of a user's language have perplexities which are an order of magnitude larger than those used in today's high performing speech systems [18]. Second, the use of speech in problem solving tasks permits knowledge sources beyond the sentence level to be used to compensate for the extra search entailed by the inc...