Benign Overfitting in Two-layer Convolutional Neural Networks

Chen, Yuan; Chen, Zixiang; Belkin, Mikhail; Gu, Quanquan

doi:10.48550/arxiv.2202.06526

Cited by 5 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We point out that as many previous works(Allen-Zhu and Li, 2020; Zou et al, 2021;Cao et al, 2022), polynomial ReLU activation can help us simplify the analysis of gradient descent, because polynomial ReLU activation can give a much larger separation of signal and noise (thus, cleaner analysis) than ReLU. Our analysis can be generalized to ReLU activation by using the arguments in (Allen-Zhu and Li, 2022).…”

mentioning

confidence: 67%

“…In order to go beyond NTK regime, one line of research has focused on the mean field limit (Song et al, 2018;Chizat and Bach, 2018;Rotskoff and Vanden-Eijnden, 2018;Wei et al, 2019;Chen et al, 2020a;Sirignano and Spiliopoulos, 2020;Fang et al, 2021). Recently, people have started to study the neural network training dynamics in the feature learning regime where data from different class is defined by a set of class-related signals which are low rank Li, 2020, 2022;Cao et al, 2022;Shi et al, 2021;Telgarsky, 2022). However, all previous works did not consider the effect of pruning.…”

Section: Related Workmentioning

confidence: 99%

“…Given this lemma, we now prove that there exists at least one neuron that is heavily aligned with the signal after training. Similarly to previous works (Allen-Zhu and Li, 2020;Zou et al, 2021;Cao et al, 2022), the analysis is divided into two phases: feature growing phase and converging phase.…”

Section: Proof Outlinementioning

confidence: 99%

“…In this section, we introduce the following signal-noise decomposition of each neuron weight from Cao et al (2022), and some useful properties for the terms in such a decomposition, which are useful in our analysis.…”

Section: Preliminary For Analysismentioning

confidence: 99%

See 3 more Smart Citations

Pruning Before Training May Improve Generalization, Provably

Yang¹,

Liang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network.

show abstract

mentioning

confidence: 67%

Section: Related Workmentioning

confidence: 99%

Section: Proof Outlinementioning

confidence: 99%

Section: Preliminary For Analysismentioning

confidence: 99%

See 2 more Smart Citations

Pruning Before Training May Improve Generalization, Provably

Yang¹,

Liang²,

Guo³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…A series of studies have proved the convergence (Jacot et al, 2018;Li and Liang, 2018;Du et al, 2019;Allen-Zhu et al, 2019b;Zou et al, 2018) and generalization (Allen-Zhu et al, 2019a;Arora et al, 2019a,b;Cao and Gu, 2019) guarantees in the so-called "neural tangent kernel" (NTK) regime, where the parameters stay close to the initialization, and the neural network function is approximately linear in its parameters. A recent line of works (Allen-Zhu and Li, 2019;Bai and Lee, 2019;Allen-Zhu and Li, 2020a,b,c;Li et al, 2020;Cao et al, 2022;Zou et al, 2021;Wen and Li, 2021) studied the learning dynamic of neural networks beyond the NTK regime. It is worthwhile to mention that our analysis of the MoE model is also beyond the NTK regime.…”

Section: Related Workmentioning

confidence: 99%

Towards Understanding Mixture of Experts in Deep Learning

Chen¹,

Deng²,

Wu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler linear classification sub-problems that individual experts can conquer. To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.

show abstract

Particle swarm optimization based neural network automatic controller for stability steering control of four-wheel drive electric vehicle

2024

Front. Mech. Eng.

View full text Add to dashboard Cite

In addressing the steering stability issues of four-wheel-drive electric vehicles on surfaces such as wet, slippery, frozen, and soft terrains, a novel control method based on particle swarm optimization for neural networks is proposed in this study. The approach integrates the advantages of Proportional-Integral-Derivative control, particle swarm optimization, and neural networks. By constructing a neural network model with input, hidden, and output layers, the study introduces particle swarm optimization algorithm for weight and structure optimization. Fuzzy logic and slip control theory are integrated into the steering stability control. The results demonstrated that, under wet and slippery road conditions, the model exhibited a system response time of 15 ms with a steering prediction accuracy of up to 92%. On frozen road surfaces, the model showed a system response time of 18 ms, with a steering prediction accuracy reaching 90%. Compared to other models, it significantly demonstrated superior steering stability control. This suggests that the designed model performs well in handling complex driving environments, indicating high application potential in the field of electric vehicle steering stability control.

show abstract

Benign Overfitting in Two-layer Convolutional Neural Networks

Cited by 5 publications

References 18 publications

Pruning Before Training May Improve Generalization, Provably

Pruning Before Training May Improve Generalization, Provably

Towards Understanding Mixture of Experts in Deep Learning

Particle swarm optimization based neural network automatic controller for stability steering control of four-wheel drive electric vehicle

Contact Info

Product

Resources

About