2021
DOI: 10.48550/arxiv.2105.06020
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Abstract: Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-ofdistribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that individual predictions are highly sensitive to noise in the randomness in training. We develop statistically rigorous m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 17 publications
1
5
0
Order By: Relevance
“…1. 10 BERT base models pretrained with different random seeds but not finetuned for particular tasks, released by Zhong et al [37] 4 . 2.…”
Section: Models We Studymentioning
confidence: 99%
See 2 more Smart Citations
“…1. 10 BERT base models pretrained with different random seeds but not finetuned for particular tasks, released by Zhong et al [37] 4 . 2.…”
Section: Models We Studymentioning
confidence: 99%
“…2. 10 BERT medium models that were initialized from pretrained models released by Zhong et al [37], that we further finetuned on MNLI with 10 different finetuning seeds (100 models total). 3.…”
Section: Models We Studymentioning
confidence: 99%
See 1 more Smart Citation
“…Aggregation when instance-level information is available. As illustrated by Zhong et al (2021); Ruder (2021), a fine-grained understanding of the model performance should include instance-level scores. If taking the mean is quite natural in the classification setting, this is not always the case, as recently pointed out by (Peyrard et al, 2021) in the NLG setting.…”
Section: Work In Progressmentioning
confidence: 99%
“…Noticeably, BioBERT large (M5.1-3) performs worse than its base counterparts, especially in the Int class, which warrants further investigation as no interpretable patterns could be found. However, recent findings [34] suggest that fine-tuning noise increases with model size and that instance-level accuracy has momentum leading to larger models having higher variance due to the fine-tuning seed.…”
Section: The Importance Of the Pre-trained Language Model Domainmentioning
confidence: 99%