The RADAR Test Methodology: Evaluating a Multi-Task Machine Learning System with Humans in the Loop

Steinfeld, Aaron; Bennett, Rachael; Cunningham, Kyle; Lahut, Matt; Quinones, Pablo-Alejandro; Wexler, Django; Siewiorek, Dan; Cohen, Paul R.; Fitzgerald, Julie C.; Hansson, Othar

doi:10.21236/ada457300

Cited by 6 publications

(3 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To investigate these questions, the RADAR team carried out extensive experimental evaluation. The details of the evaluation are reported by others in [18]; here, we give a summary.…”

Section: Discussionmentioning

confidence: 99%

The Radar Architecture for Personal Cognitive Assistance

Garlan

Schmerl

2007

Int. J. Soft. Eng. Knowl. Eng.

View full text Add to dashboard Cite

Abstract. Current desktop environments provide weak support for carrying out complex user-oriented tasks. Although individual applications are becoming increasingly sophisticated and featurerich, users must map their high-level goals to the low-level operational vocabulary of applications, and deal with a myriad of routine tasks (such as keeping up with email, keeping calendars and web sites up-to-date, etc.). An alternative vision is that of a personal cognitive assistant. Like a good secretary, such an assistant would help users accomplish their high-level goals, coordinating the use of multiple applications, automatically handling routine tasks, and, most importantly, adapting to the individual needs of a user over time. In this paper we describe the architecture and its implementation for a personal cognitive assistant called RADAR. Key features include (a) extensibility through the use of a plug-in agent architecture (b) transparent integration with legacy applications and data of today's desktop environments, and (c) extensive use of learning so that the environment adapts to the individual user over time.

show abstract

“…To investigate these questions, the RADAR team carried out extensive experimental evaluation. The details of the evaluation are reported by others in [18]; here, we give a summary.…”

Section: Discussionmentioning

confidence: 99%

The Radar Architecture for Personal Cognitive Assistance

Garlan

Schmerl

2007

Int. J. Soft. Eng. Knowl. Eng.

View full text Add to dashboard Cite

show abstract

“…The participants' primary email task is to read provided emails about an upcoming academic conference and consolidate all the changes that need to be made to the conference schedule and website [27]. They were given a spreadsheet with information about conference speakers, sessions, and talks, and asked to make changes to it based on change requests in the email, in 12 minutes.…”

Section: User Labels -Physical Activity Coachmentioning

confidence: 99%

“…They were given a spreadsheet with information about conference speakers, sessions, and talks, and asked to make changes to it based on change requests in the email, in 12 minutes. The emails and task were modified from the RADAR dataset [27]. The emails in the data set were labeled with a folder name, which was removed to test the participants.…”

Section: User Labels -Physical Activity Coachmentioning

confidence: 99%

Towards maximizing the accuracy of human-labeled sensor data

Rosenthal

Dey

2010

Proceedings of the 15th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

We present two studies that evaluate the accuracy of human responses to an intelligent agent's data classification questions. Prior work has shown that agents can elicit accurate human responses, but the applications vary widely in the data features and prediction information they provide to the labelers when asking for help. In an initial analysis of this work, we found the five most popular features, namely uncertainty, amount and level of context, prediction of an answer, and request for user feedback. We propose that there is a set of these data features and prediction information that maximizes the accuracy of labeler responses. In our first study, we compare accuracy of users of an activity recognizer labeling their own data across the dimensions. In the second study, participants were asked to classify a stranger's emails into folders and strangers' work activities by interruptibility. We compared the accuracy of the responses to the users' self-reports across the same five dimensions. We found very similar combinations of information (for users and strangers) that led to very accurate responses as well as more feedback that the agents could use to refine their predictions. We use these results for insight into the information that help labelers the most.

show abstract