Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes

Ma, Chao; Kunin, Daniel; Wu, Lei; Ying, Lexing

doi:10.4208/jml.220404

Cited by 3 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We propose an Implicit Regularization Enhancement (IRE) framework to speed up the convergence towards flatter minima. As suggested by works like Blanc et al (2020), and Ma et al (2022), the implicit sharpness reduction often occurs at a very slow pace, along flat directions. Inspired by this picture, IRE particularly accelerates the dynamics along flat directions, while keeping sharp directions' dynamics unchanged.…”

Section: Introductionmentioning

confidence: 77%

“…Wu et al (2018; and Ma and Ying (2021) provided an explanation of implicit sharpness regularization from a dynamical stability perspective. Moreover, in-depth analysis of SGD dynamics near global minima shows that the SGD noise (Blanc et al, 2020;Ma et al, 2022;Damian et al, 2021) and the edge of stability (EoS)-driven (Wu et al, 2018;Cohen et al, 2021) oscillations (Even et al, 2024) can drive SGD/GD towards flatter minima. Additional studies explored how training components, including learning rate and batch size (Jastrzębski et al, 2017), normalization (Lyu et al, 2022), cyclic LR (Wang and Wu, 2023), influence this sharpness regularization.…”

Section: Related Workmentioning

confidence: 99%

“…The most popular explanation for implicit regularization is that SGD and its variants tend to converge to flat minima (Keskar et al, 2016;Wu et al, 2017), and flat minima generalize better (Hochreiter and Schmidhuber, 1997;Jiang et al, 2019). However, the process of this implicit sharpness regularization occurs at a very slow pace, as demonstrated in works such as Blanc et al (2020), , and Ma et al (2022). Consequently, practitioners often use a large learning rate (LR) and extend the training time even when the loss no longer decreases, ensuring the convergence to flatter minima (He et al, 2016;Goyal et al, 2017;Hoffer et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Linguistic Semiotics

Wang¹

2020

Peking University Linguistics Research

View full text Add to dashboard Cite

Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.

show abstract

Section: Introductionmentioning

confidence: 77%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Linguistic Semiotics

Wang¹

2020

Peking University Linguistics Research

View full text Add to dashboard Cite

show abstract

“…Another similar idea that focus on the local property of loss landscape is also contributive. Another work (Ma et al 2022) extends the existing literature on the optimization of neural network loss functions by addressing the limitations of the quadratic approximation and emphasizing the importance of the multiscale structure. Their work contributes to the field by empirically demonstrating the subquadratic growth and separate scales structure, offering explanations for intriguing training phenomena.…”

Section: Related Workmentioning

confidence: 99%

On the Unstable Convergence Regime of Gradient Descent

Chen,

Peng,

et al. 2024

AAAI

View full text Add to dashboard Cite

Traditional gradient descent (GD) has been fully investigated for convex or L-smoothness functions, and it is widely utilized in current neural network optimization. The classical descent lemma ensures that for a function with L-smoothness, the GD trajectory converges stably towards the minimum when the learning rate is below 2 / L. This convergence is marked by a consistent reduction in the loss function throughout the iterations. However, recent experimental studies have demonstrated that even when the L-smoothness condition is not met, or if the learning rate is increased leading to oscillations in the loss function during iterations, the GD trajectory still exhibits convergence over the long run. This phenomenon is referred to as the unstable convergence regime of GD. In this paper, we present a theoretical perspective to offer a qualitative analysis of this phenomenon. The unstable convergence is in fact an inherent property of GD for general twice differentiable functions. Specifically, the forwardinvariance of GD is established, i.e., it ensures that any point within a local region will always remain within this region under GD iteration. Then, based on the forward-invariance, for the initialization outside an open set containing the local minimum, the loss function will oscillate at the first several iterations and then become monotonely decreasing after the GD trajectory jumped into the open set. This work theoretically clarifies the unstable convergence phenomenon of GD discussed in previous experimental works. The unstable convergence of GD mainly depends on the selection of the initialization, and it is actually inevitable due to the complex nature of loss function.

show abstract

“…We note that analysis only applies to stochastic gradient descent. In case of full gradient descent there have been several recent works showing that quadratic approximation model might be toosimplistic (Ma et al, 2022;Damian et al, 2022;Cohen et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Training trajectories, mini-batch losses and the curious role of the learning rate

Sandler¹,

Zhmoginov²,

Vladymyrov³

et al. 2023

Preprint

View full text Add to dashboard Cite

Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed minibatches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.

show abstract

Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes

Cited by 3 publications

References 11 publications

Linguistic Semiotics

Linguistic Semiotics

On the Unstable Convergence Regime of Gradient Descent

Training trajectories, mini-batch losses and the curious role of the learning rate

Contact Info

Product

Resources

About