1Improved Performance in Spoken Natural Language Dialog SystemsSince approximately the mid 1980's, technology has been adequate (if not ideal) for researchers to construct spoken natural language dialog systems (SNLDS) in order to test theories of natural language processing and to see what machines were capable of based on current technological limits. Over the course of time, a few systems have been constructed in sufficient detail and robustness to enable some evaluation of the systems. For the most part, these systems were greatly limited by the available speech recognition technology. Continuous speech systems required speaker dependent training and restricted vocabularies, but still had such a large number of misrecognitions that this tended to be the limiting factor in the success of the system. For example, testing in 1991 of the Circuit Fix-It Shop of (Smith, Hipp, and Biermann, 1995) required an experimenter to remain in the room in order to notify the user when misrecognitions occurred. Fortunately, speech recognition capabilities are improving, and systems are being constructed that allow individuals to walk up and use them after a brief orientation. One example is the TRAINS system of (Allen et al., 1995) that was demonstrated at the 1995 ACL conference, where people just sat down and used the system after a brief set of instructions were given to them by the demonstrator. Another example is the current system under development at Duke University that serves as a tutor for liberal arts students learning the basics of Pascal programming. In this system, the machine itself explains how to use it. More thorough and challenging methods of evaluation are now feasible. This paper proposes some measures for evaluation based on a retrospective look at measures used in the past, analyzing their relevance in today's environment.For the future, expect measurements of speech recognition performance and basic utterance understanding to remain important, but there should also be more emphasis on measuring robustness and measuring the utility of domain-independent knowledge about dialog. Furthermore, we should expect realtime response from evaluated systems, a sharp reduction in the amount of specialized training for using systems, and the use of longitudinal studies to see how user behavior evolves.
Fundamentals in Evaluation
Linguistic CoverageA forward looking view of evaluation is offered by (Whittaker and Stenton, 1989). It is forward looking in the sense that they investigated issues in evaluation independent of building a system. Their perspective was not based on a specific SNLDS, but a general analysis of the issue of evaluation. Their main point was that evaluation needed to be placed within the context of a system's use. Consequently, they used a Wizard of Oz study in an information retrieval environment (e.g., database query) in order to identify the types of natural language inputs a typical user would use in order to gain access to needed information. Their analysis identified the following requireme...