2022
DOI: 10.48550/arxiv.2205.10343
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Understanding Grokking: An Effective Theory of Representation Learning

Abstract: We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of fou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 13 publications
1
15
0
Order By: Relevance
“…These include: (i) the precise role of regularization in deep nonlinear neural networks; (ii) feature learning; (iii) the role of training data distributions in optimization dynamics and generalization performance of the network; (iv) data-, parameter-and computeefficiency of training; (v) interpretability of learnt features; and (vi) expressivity of architectures and complexity of tasks. 6)- (7). The Fourier image shows the same peak as found by GD, but also weak peaks corresponding to 2m = 6 mod 97, 2n = 6 mod 97 and m − n = 6 mod 97 that were suppresed by the choice of phases via (12).…”
Section: Introduction and Overview Of Literaturesupporting
confidence: 54%
See 4 more Smart Citations
“…These include: (i) the precise role of regularization in deep nonlinear neural networks; (ii) feature learning; (iii) the role of training data distributions in optimization dynamics and generalization performance of the network; (iv) data-, parameter-and computeefficiency of training; (v) interpretability of learnt features; and (vi) expressivity of architectures and complexity of tasks. 6)- (7). The Fourier image shows the same peak as found by GD, but also weak peaks corresponding to 2m = 6 mod 97, 2n = 6 mod 97 and m − n = 6 mod 97 that were suppresed by the choice of phases via (12).…”
Section: Introduction and Overview Of Literaturesupporting
confidence: 54%
“…In addition, Ref. [7] argued that grokking is due to the competition between encoder and decoder. While it is certainly true in their model, in the present case there is no learnable encoder but grokking is still present.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations