2005
DOI: 10.1162/089120105774321109
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Discourse and Dialogue Coding Schemes

Abstract: Agreement statistics play an important role in the evaluation of coding schemes for discourse and dialogue. Unfortunately there is a lack of understanding regarding appropriate agreement measures and how their results should be interpreted. In this article we describe the role of agreement measures and argue that only chance-corrected measures that assume a common distribution of labels for all coders are suitable for measuring agreement in reliability studies. We then provide recommendations for how reliabili… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
36
0
8

Year Published

2006
2006
2017
2017

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(47 citation statements)
references
References 5 publications
3
36
0
8
Order By: Relevance
“…The correction for chance agreement in Cohen's kappa has been the subject of much controversy (Brennan and Prediger, 1981;Feinstein and Cicchetti, 1990;Uebersax, 1987;Byrt et al, 1993;Gwet, 2002;Di Eugenio and Glass, 2004;Sim and Wright, 2005;Craggs and Wood, 2005;Powers, 2012). Firstly, it assumes that when assessors are unsure of a score, they guess at random according to a fixed prior distribution of scores.…”
Section: Discussionmentioning
confidence: 99%
“…The correction for chance agreement in Cohen's kappa has been the subject of much controversy (Brennan and Prediger, 1981;Feinstein and Cicchetti, 1990;Uebersax, 1987;Byrt et al, 1993;Gwet, 2002;Di Eugenio and Glass, 2004;Sim and Wright, 2005;Craggs and Wood, 2005;Powers, 2012). Firstly, it assumes that when assessors are unsure of a score, they guess at random according to a fixed prior distribution of scores.…”
Section: Discussionmentioning
confidence: 99%
“…In corpus research there is much work with annotations that need subjective judgements of a more subjective nature from an annotator about the behavior being annotated. This holds for Human Computer Interaction topics such as affective computing or the development of Embodied Conversational Agents with a personality, but also for work in computational linguistics on topics such as emotion (Craggs and McGee Wood, 2005), subjectivity (Wiebe et al, 1999;Wilson, 2008) and agreement and disagreement (Galley et al, 2004). If we want to interpret the results of classifiers in terms of the patterns of (dis)agreement found between annotators, we need to subject the classifiers with respect to each other and to the 'ground truth data' to the same analyses used to evaluate and compare annotators to each other.…”
Section: Related Workmentioning
confidence: 99%
“…More data will yield more signal and the learner will ignore the noise. However, as Craggs and McGee Wood (2005) suggest, this also makes systematic disagreement dangerous, because it provides an unwanted pattern for the learner to detect. We demonstrate that machine learning can tolerate data with a low reliability measurement as long as the disagreement looks like random noise, and that when it does not, data can have a reliability measure commonly held to be acceptable but produce misleading results.…”
Section: Introductionmentioning
confidence: 99%