In human-computer interaction, as in conversation, neither partner is omniscient. To facilitate repairs when problems arise, an interface needs to enable both user and system to coordinate their separate knowledge states. We present a conversational feedback model for human-computer interaction, based on a collaborative theory of human communication (Clark & Schaefer, 1989) and use this model to systematically provide context-sensitive feedback messages from an application-independent spoken language system. We then describe a simulation, an informal user study, and a working prototype that use this model in a telephone agent application that allows dialing by voice.
TOWARD A MORE ROBUST SPEECH INTERFACETraditional approaches to improving the performance of spoken language systems have focused on improving the accuracy of the underlying speech recognition and natural language processing technology. The assumption is that if a system can translate exactly what the user said into text and then map this onto an application command, speech will be a successful input technique. With this approach, improving speech recognition accuracy requires asymptotic effort that ultimately is reflected in the cost of the technology.We argue that perfect performance by a speech recognizer is simply not possible, nor should it be the goal. There are limiting factors that are difficult or impossible to control, such as variability in the acoustic environment. Moreover, many words and phrases in English are homophones of other words and phrases, so in some situations, both human and machine listeners find them ambiguous. People frequently have trouble discerning, remembering, or guessing the grammar and vocabulary that a system expects and then limiting themselves to it -this has been dubbed "the vocabulary problem" by Furnas, Landauer, Gomez, and Dumais, 1987. In addition, ambiguous input is a problem on the syntactic, semantic, and pragmatic levels. Finally, because people have many other demands on them while they are speaking such as performing tasks, planning what to say next and monitoring their listeners and the environment, they frequently do not produce the kind of fluent but constrained speech that a speech recognizer has been trained to process. Human utterances frequently contain speech errors, and yet this is rarely a problem for human listeners. The intrinsic limits on the well-formedness of utterances and on the accuracy of speech recognition technology suggest that to solve the problem, we must first redefine it. Let us start by considering how people handle these problems in conversation.