Proceedings of the First Workshop on Insights From Negative Results in NLP 2020
DOI: 10.18653/v1/2020.insights-1.15
|View full text |Cite
|
Sign up to set email alerts
|

If You Build Your Own NER Scorer, Non-replicable Results Will Come

Abstract: We attempt to replicate a named entity recognition (NER) model implemented in a popular toolkit and discover that a critical barrier to doing so is the inconsistent evaluation of improper label sequences. We define these sequences and examine how two scorers differ in their handling of them, finding that one approach produces F1 scores approximately 0.5 points higher on the CoNLL 2003 English development and test sets. We propose best practices to increase the replicability of NER evaluations by increasing tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
6

Relationship

3
3

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 6 publications
0
9
0
1
Order By: Relevance
“…While it is capable of identifying invalid transitions and supporting one's own implementation to constrain or repair invalid sequences, it does not provide common methods for repairing invalid sequences. Lignos and Kamyab (2020) demonstrate the difference that can occur when two scorers handle invalid label sequences differently. However, they do not provide any software to evaluate these differences and only test using CoNLL-03 English data with older neural models.…”
Section: Handling Invalid Label Transitionsmentioning
confidence: 91%
See 2 more Smart Citations
“…While it is capable of identifying invalid transitions and supporting one's own implementation to constrain or repair invalid sequences, it does not provide common methods for repairing invalid sequences. Lignos and Kamyab (2020) demonstrate the difference that can occur when two scorers handle invalid label sequences differently. However, they do not provide any software to evaluate these differences and only test using CoNLL-03 English data with older neural models.…”
Section: Handling Invalid Label Transitionsmentioning
confidence: 91%
“…While we have described the repair methods that we are aware of, others may exist whether intentionally or as accidental deviations from more common repair methods. Lignos and Kamyab (2020) demonstrate the variation that occurs due to different repair methods for invalid label transitions, finding that at least one NER toolkit takes an alternate approach to handling invalid transitions that consistently produces higher F1 scores for some models than scoring with conlleval. Its approach is not incorrect; these "edge cases" can be interpreted different ways.…”
Section: Repairs In Practicementioning
confidence: 93%
See 1 more Smart Citation
“…This spread allows for better differentiation, even though there is a higher standard deviation for each score. For example, BERT-CRF generally performs very similarly to BERT, but scores 5.87 points lower for TCM-UNSEEN, possibly due to how the CRF handles lower-confidence predictions differently (Lignos and Kamyab, 2020). Flair has the highest all-mentions recall and the highest recall for TCMs, suggesting that when type-confusable mentions have been seen in the training data, it is able to effectively disambiguate types based on context.…”
Section: Tmr For Englishmentioning
confidence: 99%
“…As an example of a common departure from these practices, many papers that perform NER experiments publish the scores produced by NCRF++ (Yang et al, 2018). As previously detailed by Lignos and Kamyab (2020), NCRF++ uses an internal scorer with an undocumented label sequence repair method, so reporting any numbers from it would be contrary to guidelines 2 and 3. As Lignos and Kamyab demonstrated, on a specific subset of models that produce a high number of invalid transitions, that scorer produces F1 scores approximately half a point higher than the most commonly-used external scorer.…”
Section: Guidelines For Reproducibilitymentioning
confidence: 99%