A Modern Take on the Bias-Variance Tradeoff in Neural Networks

Neal, Brady; Mittal, Sarthak; Baratin, Aristide; Tantia, Vinayak; Scicluna, Matthew; Lacoste-Julien, Simon; Mitliagkas, Ioannis

doi:10.48550/arxiv.1810.08591

Cited by 48 publications

(64 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each of these control parameters directly induces a different loss landscape by changing the data S train and/or architecture f θ for which the loss L(θ) is being computed. For example, we expect that increasing width will result in a smoother loss landscape [38]; we shall see this effect with CKA similarity in the transition from Phase IV-A to IV-B.…”

Section: Setupmentioning

confidence: 59%

Taxonomizing local versus global structure in neural network loss landscapes

Yang

Hodgkinson

Theisen

et al. 2021

Preprint

View full text Add to dashboard Cite

Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization performance). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy. Based on these results, we develop a simple one-dimensional model with load-like and temperature-like parameters, we introduce the notion of an effective loss landscape depending on these parameters, and we interpret our results in terms of a rugged convexity of the loss landscape. When viewed through this lens, our detailed empirical results shed light on phases of learning (and consequent double descent behavior), fundamental versus incidental determinants of good generalization, the role of load-like and temperature-like parameters in the learning process, different influences on the loss landscape from model and data, and the relationships between local and global metrics, all topics of recent interest.

show abstract

Section: Setupmentioning

confidence: 59%

Taxonomizing local versus global structure in neural network loss landscapes

Yang

Hodgkinson

Theisen

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…From the perspective of bias/variance trade-off, Geman et al (1992), and more recently, Neal et al (2018) empirically observe that while bias is monotonically decreasing, variance could be decreasing too or unimodal as the number of parameters increases, thus manifesting a double descent generalization curve. Hastie et al (2019) analytically study the variance.…”

Section: Related Work and Discussionmentioning

confidence: 97%

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Pezeshki¹,

Mitra²,

Bengio³

et al. 2021

Preprint

View full text Add to dashboard Cite

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the highdimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

show abstract

“…Experimentation is amplified by label noise. With the observation of unimodel variance (Neal et al, 2018), (Yang et al, 2020) decomposes the risk into bias and variance, and posits that double descent arises due to the bell-shaped variance curve rising faster than the bias decreases.…”

Section: Related Workmentioning

confidence: 99%

Mitigating Deep Double Descent by Concatenating Inputs

Chen

Wang

Kyrillidis

2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.

show abstract

A Modern Take on the Bias-Variance Tradeoff in Neural Networks

Cited by 48 publications

References 0 publications

Taxonomizing local versus global structure in neural network loss landscapes

Taxonomizing local versus global structure in neural network loss landscapes

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Mitigating Deep Double Descent by Concatenating Inputs

Contact Info

Product

Resources

About