Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05 2005
DOI: 10.3115/1220575.1220579
|View full text |Cite
|
Sign up to set email alerts
|

On coreference resolution performance metrics

Abstract: The paper proposes a Constrained Entity-Alignment F-Measure (CEAF) for evaluating coreference resolution. The metric is computed by aligning reference and system entities (or coreference chains) with the constraint that a system (reference) entity is aligned with at most one reference (system) entity. We show that the best alignment is a maximum bipartite matching problem which can be solved by the Kuhn-Munkres algorithm. Comparative experiments are conducted to show that the widelyknown MUC F-measure has seri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
345
0
7

Year Published

2010
2010
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 356 publications
(352 citation statements)
references
References 8 publications
0
345
0
7
Order By: Relevance
“…In the first one automatically detected mentions are provided to the models and in the second one the mentions are gold. 4 The metrics used in our evaluations are MUC (Vilain et al, 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005), CEAF m (Luo, 2005), and BLANC (Recasens and Hovy, 2011). The scores have been calculated using the reference implementation of the CoNLL scorer (Pradhan et al, 2014).…”
Section: Resultsmentioning
confidence: 99%
“…In the first one automatically detected mentions are provided to the models and in the second one the mentions are gold. 4 The metrics used in our evaluations are MUC (Vilain et al, 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005), CEAF m (Luo, 2005), and BLANC (Recasens and Hovy, 2011). The scores have been calculated using the reference implementation of the CoNLL scorer (Pradhan et al, 2014).…”
Section: Resultsmentioning
confidence: 99%
“…But intrinsic human judgements are simply not consistent and reliable enough to provide an objective meta-evaluation tool. 23 Moreover, all they provide is an insight into what humans (think they) like, not what is best or most useful for them (the two can be two very different matters, as discussed in [4]). …”
Section: Evaluation Methodsmentioning
confidence: 99%
“…There does not appear to be a single standard evaluation metric in the coreference resolution community. We opted to use the following three: muc-6 [38], ceaf [23], and b-cubed [1], which seem to be the most widely accepted metrics. All three metrics compute Recall, Precision and F-Scores on aligned gold-standard and resolver-tool coreference chains.…”
Section: Automatic Extrinsic Evaluation Of Claritymentioning
confidence: 99%
“…We re-use features that are commonly used for mention pair classification (see e.g., [23], [4]), including grammatical type and subtypes, string and substring matches, apposition and copula, distance (number of separating mentions/sentences/words), gender and number match, synonymy/hypernym and animacy (based on WordNet), family name (based on closed lists), named entity types, syntactic features and anaphoricity detection. Evaluation metrics The systems' outputs are evaluated using the three standard coreference resolution metrics: MUC [29], B 3 [2], and Entity-based CEAF (or CEAF e ) [20]. Following the convention used in CoNLL-2012, we report a global F1-score (henceforth, CoNLL score), which corresponds to an unweighted average of the MUC, B 3 and CEAF e F1 scores.…”
Section: Noun Phrase Coreference Resolutionmentioning
confidence: 99%