Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems 2020
DOI: 10.18653/v1/2020.eval4nlp-1.13
|View full text |Cite
|
Sign up to set email alerts
|

Are Some Words Worth More than Others?

Abstract: Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written languag… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 24 publications
0
6
0
Order By: Relevance
“…Along with perplexity, the value of accuracy is often stated. This metric expresses the share of correctly predicted tokens in the output sequence [30].…”
Section: B: Accuracymentioning
confidence: 99%
“…Along with perplexity, the value of accuracy is often stated. This metric expresses the share of correctly predicted tokens in the output sequence [30].…”
Section: B: Accuracymentioning
confidence: 99%
“…Historical data is usually quite low-resourced, which provides an additional challenge to the detection of sparsely distributed Swadesh items. This requires using special metrics for imbalanced data (Dudy and Bedrick, 2020). The harmonic F1 score, traditionally used for such cases (Chinchor, 1992), still finds its application in the analysis of NLP tasks (Scherrer, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Mc-Coy et al (2021) introduced analyses to assess sequential and syntactic novelty in LMs. Focusing on the word frequency distribution, Dudy & Bedrick (2020) found that LMs under-perform when less frequent examples are encountered at test time. In the classification setting, various approaches have been proposed to help alleviate class imbalance in the data distribution, such as data augmentation (Sagawa et al, 2020) or the transfer of knowledge from high-frequency classes to infrequent ones (Ouyang et al, 2016;Zhu et al, 2014;.…”
Section: Related Workmentioning
confidence: 99%
“…Meister & Cotterell (2021), for example, investigated the statistical tendencies of the distribution defined by neural LMs, whereas Kulikov et al (2021) explored whether they adequately capture the modes of the distribution they attempt to model. At the same time, increased focus has been given to performance on rare or novel events in the data distribution, both for models of natural language (McCoy et al, 2021;Lent et al, 2021;Dudy & Bedrick, 2020;Oren et al, 2019) and neural models more generally (see, for example Sagawa et al, 2020;D'souza et al, 2021;Blevins & Zettlemoyer, 2020;Czarnowska et al, 2019;Horn & Perona, 2017;Ouyang et al, 2016;Bengio, 2015;Zhu et al, 2014). Neither of these branches of work, however, has explored instancelevel LM performance on rare sequences in the distribution.…”
Section: Introductionmentioning
confidence: 99%