If You Build Your Own NER Scorer, Non-replicable Results Will Come

Lignos, Constantine; Kamyab, Marjan

doi:10.18653/v1/2020.insights-1.15

Cited by 9 publications

(10 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…While it is capable of identifying invalid transitions and supporting one's own implementation to constrain or repair invalid sequences, it does not provide common methods for repairing invalid sequences. Lignos and Kamyab (2020) demonstrate the difference that can occur when two scorers handle invalid label sequences differently. However, they do not provide any software to evaluate these differences and only test using CoNLL-03 English data with older neural models.…”

Section: Handling Invalid Label Transitionsmentioning

confidence: 91%

“…While we have described the repair methods that we are aware of, others may exist whether intentionally or as accidental deviations from more common repair methods. Lignos and Kamyab (2020) demonstrate the variation that occurs due to different repair methods for invalid label transitions, finding that at least one NER toolkit takes an alternate approach to handling invalid transitions that consistently produces higher F1 scores for some models than scoring with conlleval. Its approach is not incorrect; these "edge cases" can be interpreted different ways.…”

Section: Repairs In Practicementioning

confidence: 93%

“…3 As an example of a common departure from these practices, many papers that perform NER experiments publish the scores produced by NCRF++ (Yang et al, 2018). As previously detailed by Lignos and Kamyab (2020), NCRF++ uses an internal scorer with an undocumented label sequence repair method, so reporting any numbers from it would be contrary to guidelines 2 and 3. As Lignos and Kamyab demonstrated, that scorer produces F1 scores approximately half a point higher than the most commonly-used external scorer.…”

Section: Guidelines For Reproducibilitymentioning

confidence: 99%

See 2 more Smart Citations

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

Palen-Michel¹,

Holley²,

Lignos³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

To address what we believe is a looming crisis of unreproducible evaluation for named entity recognition tasks, we present guidelines for reproducible evaluation. The guidelines we propose are extremely simple, focusing on transparency regarding how chunks are encoded and scored, but very few papers currently being published fully comply with them. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and are statistically significant. We provide SeqScore, an open source toolkit that addresses many of the issues that cause replication failures and makes following our guidelines easy.

show abstract

Section: Handling Invalid Label Transitionsmentioning

confidence: 91%

Section: Repairs In Practicementioning

confidence: 93%

Section: Guidelines For Reproducibilitymentioning

confidence: 99%

See 1 more Smart Citation

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

Palen-Michel¹,

Holley²,

Lignos³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This spread allows for better differentiation, even though there is a higher standard deviation for each score. For example, BERT-CRF generally performs very similarly to BERT, but scores 5.87 points lower for TCM-UNSEEN, possibly due to how the CRF handles lower-confidence predictions differently (Lignos and Kamyab, 2020). Flair has the highest all-mentions recall and the highest recall for TCMs, suggesting that when type-confusable mentions have been seen in the training data, it is able to effectively disambiguate types based on context.…”

Section: Tmr For Englishmentioning

confidence: 99%

TMR: Evaluating NER Recall on Tough Mentions

Lignos

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W

Self Cite

View full text Add to dashboard Cite

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of "tough" mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.

show abstract

“…As an example of a common departure from these practices, many papers that perform NER experiments publish the scores produced by NCRF++ (Yang et al, 2018). As previously detailed by Lignos and Kamyab (2020), NCRF++ uses an internal scorer with an undocumented label sequence repair method, so reporting any numbers from it would be contrary to guidelines 2 and 3. As Lignos and Kamyab demonstrated, on a specific subset of models that produce a high number of invalid transitions, that scorer produces F1 scores approximately half a point higher than the most commonly-used external scorer.…”

Section: Guidelines For Reproducibilitymentioning

confidence: 99%

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

Palen-Michel¹,

Holley²,

Lignos³

2021

Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Self Cite

View full text Add to dashboard Cite

To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.

show abstract

If You Build Your Own NER Scorer, Non-replicable Results Will Come

Cited by 9 publications

References 6 publications

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

TMR: Evaluating NER Recall on Tough Mentions

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

Contact Info

Product

Resources

About