Backward Feature Correction: How Deep Learning Performs Deep Learning

Allen-Zhu, Zeyuan; Li, Yuanzhi

doi:10.48550/arxiv.2001.04413

Cited by 35 publications

(124 citation statements)

References 27 publications

Supporting

Mentioning

122

Contrasting

Order By: Relevance

“…Additionally, the authors also demonstrate experimentally that depth 3 networks outperform depth 2 networks for learning ball indicators, even if the shallow network has many more tweakable weights. While we build on a similar proof strategy as in Safran and Shamir [25], our work differs from theirs in that they show that for a ball indicator function of any radius, there exists a distribution under which it is hard to approximate using depth 2 networks, whereas our work shows that for a specific distribution there exists a ball indicator function with radius in the interval [1,2] which is hard to approximate using a depth 2 network. Moreover, our main result rigorously proves the learnability of the target function used in their experiment.…”

Section: Related Workmentioning

confidence: 98%

“…• We show that there exist a sequence {D d } ∞ d=2 of d-dimensional distributions and a sequence of constants {λ d } ∞ d=2 , where λ d ∈ [1,2] for all d, such that no neural network of depth 2 and width less than Ω (exp(Ω(d))) can approximate ball indicator functions with radii {λ d } ∞ d=2 to accuracy better than Ω(d −4 ) (Thm. 2.2).…”

Section: Our Contributionsmentioning

confidence: 99%

“…• We prove that under certain assumptions on the distribution of the data (Assumption 3.1), a neural network with two layers of non-linear activations (see Eq. ( 2) for the formal architecture) can efficiently learn data labeled by a ball indicator function with radius in [1,2], and attain arbitrarily small population loss by using GD with a standard initialization (Assumption 3.2), where the hidden layer is held fixed throughout training (Thm. 3.3).…”

Section: Our Contributionsmentioning

confidence: 99%

“…In Allen-Zhu et al [2], the authors prove that a certain class of depth 3 target functions can be improperly learned using depth 3 networks with a different architecture. Allen-Zhu and Li [1] show that deep networks with quadratic activations can heirarchically learn certain concept classes. However, the learning algorithm used in Goel and Klivans [13] is different than GD, and all three papers crucially rely on the input having unit norm or some other form of boundedness, whereas our work accommodates for certain inputs whose norm distribution does not have finite moments.…”

Section: Related Workmentioning

confidence: 99%

“…where σ is some non-linear activation function which satisfies Assumptions 1 and 2 in Eldan and Shamir [10]. 1 As the stronger class with superior approximation capabilities, we consider the following 'depth 3' neural networks with a simplified structure.…”

Section: Notation and Terminologymentioning

confidence: 99%

See 4 more Smart Citations

Optimization-Based Separations for Neural Networks

Safran¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. Since it is known that ball indicators are hard to approximate with respect to a certain heavy-tailed distribution when using depth 2 networks with a single layer of non-linearities [25], this establishes what is to the best of our knowledge, the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique relies on a random features approach which reduces the problem to learning with a single neuron, where new tools are required to show the convergence of gradient descent when the distribution of the data is heavy-tailed.

show abstract

Section: Related Workmentioning

confidence: 98%

Section: Our Contributionsmentioning

confidence: 99%

Section: Our Contributionsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Notation and Terminologymentioning

confidence: 99%

See 3 more Smart Citations

Optimization-Based Separations for Neural Networks

Safran¹,

Lee²

2021

Preprint

View full text Add to dashboard Cite

show abstract

Particle dual averaging: optimization of mean field neural network with global convergence rate analysis*

Nitanda

Suzuki

2022

J. Stat. Mech.

View full text Add to dashboard Cite

We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical/expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.

show abstract

Lane detection with position embedding

Xie

Han²,

Qi³

et al. 2022

Fourteenth International Conference on Digital Image Processing (ICDIP 2022)

View full text Add to dashboard Cite

Recently, lane detection has made great progress in autonomous driving. RESA (REcurrent Feature-Shift Aggregator) is based on image segmentation. It presents a novel module to enrich lane feature after preliminary feature extraction with an ordinary CNN. For Tusimple dataset, there is not too complicated scene and lane has more prominent spatial features. On the basis of RESA, we introduce the method of position embedding to enhance the spatial features. The experimental results show that this method has achieved the best accuracy 96.93% on Tusimple dataset.

show abstract

Backward Feature Correction: How Deep Learning Performs Deep Learning

Cited by 35 publications

References 27 publications

Optimization-Based Separations for Neural Networks

Optimization-Based Separations for Neural Networks

Particle dual averaging: optimization of mean field neural network with global convergence rate analysis*

Lane detection with position embedding

Contact Info

Product

Resources

About