2022
DOI: 10.48550/arxiv.2210.16859
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Solvable Model of Neural Scaling Laws

Abstract: Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -a joint generative data model and random feature model -th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 60 publications
1
10
0
Order By: Relevance
“…(2021), Michaud et al (2023) or Debowski (2023), although we focus on practical models with finite capacity. The discrete nature of tokens contrasts with other recent works on scaling laws that have focused on continuous Gaussian inputs (e.g., Bahri et al, 2021;Maloney et al, 2022;Sorscher et al, 2022).…”
Section: Introductionmentioning
confidence: 89%
“…(2021), Michaud et al (2023) or Debowski (2023), although we focus on practical models with finite capacity. The discrete nature of tokens contrasts with other recent works on scaling laws that have focused on continuous Gaussian inputs (e.g., Bahri et al, 2021;Maloney et al, 2022;Sorscher et al, 2022).…”
Section: Introductionmentioning
confidence: 89%
“…Scaling Exponents are Task-Dependent at Late Training Time, but not at Early Time. Prior works (Dyer & Gur-Ari, 2020;Atanasov et al, 2023;Roberts et al, 2022; predict early-time finite-width loss corrections that go as 1/width near the infinite width limit in either lazy or feature-learning regimes. Bahri et al (2021) et al provide experiments demonstrating the 1/width convergence.…”
Section: Preprintmentioning
confidence: 98%
“…The model of Sharma and Kaplan (2022) was also generalized by Bahri et al (2021) to account for power law scaling in training data and who additionally relate scaling exponents to a power law spectrum of certain kernels. Maloney et al (2022) develop a random-feature model of scaling, in which power law scaling comes from power law spectra of the data feature-feature covariance matrix, and scaling exponents are determined by the power law exponent over these spectra. Hutter (2021) propose a toy model of data scaling in which features are learned based on whether they've been seen during training, and a Zipfian distribution over features produces power law data scaling.…”
Section: Data Scaling (Single-epoch)mentioning
confidence: 99%