2022
DOI: 10.48550/arxiv.2203.05482
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Abstract: The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter config… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 29 publications
(42 citation statements)
references
References 39 publications
(64 reference statements)
0
42
0
Order By: Relevance
“…In some domain shift scenarios [25,46,15,14], OOD generalization behavior may therefore form basins. In current work, however, barriers are not observed in vision during transfer [35,52]. Is that because all vision models achieve a basin with good structural generalization-or none do?…”
Section: Discussion and Future Workmentioning
confidence: 80%
See 1 more Smart Citation
“…In some domain shift scenarios [25,46,15,14], OOD generalization behavior may therefore form basins. In current work, however, barriers are not observed in vision during transfer [35,52]. Is that because all vision models achieve a basin with good structural generalization-or none do?…”
Section: Discussion and Future Workmentioning
confidence: 80%
“…The split between generalization strategies can potentially explain results from the bimodality of CoLA models [33] to wide variance on NLI diagnostic sets [31]. Because weight averaging can find parameter settings that fall on a barrier, we may even explain why weight averaging, which tends to perform well on vision tasks, fails in text classifiers [52]. Future work that distinguishes generalization strategy basins could improve the performance of such weight ensembling methods.…”
Section: Discussion and Future Workmentioning
confidence: 99%
“…Prior work (Wortsman et al, 2022) show that averaging the weights of multiple models fine-tuned with different hyper-parameter configurations improves model performance. They analytically show the similarity in loss between weight-averaging (L AM W in our setting) and logit-ensembling (L Ens W in our setting) as a function of the flatness of the loss and confidence of the predictions.…”
Section: Connection To Bayesian Neural Network and Model Ensemblingmentioning
confidence: 99%
“…MPQA (Wiebe et al, 2005) and Subj (Pang & Lee, 2004) are used for polarity and subjectivity detection, where we follow Matena and Raffel Matena & Raffel (2021) propose to merge pre-trained language models which are fine-tuned on various text classification tasks. Wortsman et al (2022) explores averaging model weights from various independent runs on the same task with different hyper-parameter configurations. Different from existing works, we focus on averaging weights of newly-added parameters for parameter-efficient fine-tuning purpose.…”
Section: Few-shot Performancementioning
confidence: 99%
See 1 more Smart Citation