Mean-field Langevin dynamics and energy landscape of neural networks

Hu, Kaitong; Ren, Zhenjie; Šiška, David; Szpruch, Łukasz

doi:10.1214/20-aihp1140

Cited by 21 publications

(29 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, law of large numbers and central limit theorems have been established for neural networks with a single hidden layer [10,30,43,48,49,50]. For a single hidden layer, one can directly study the weak convergence of the empirical measure of the parameters.…”

Section: Introductionmentioning

confidence: 99%

“…As previously discussed, related limiting results for the single-layer neural network case have been investigated in [10,30,43,48,49,50]. In those papers, it is proven that as the number of hidden units and stochastic gradient descent steps, in the appropriate scaling, diverge to infinity, the empirical distribution of the neural network parameters converges to the weak solution of a non-local PDE.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mean Field Analysis of Neural Networks: A Law of Large Numbers

Sirignano¹,

Spiliopoulos

2020

SIAM J. Appl. Math.

113

View full text Add to dashboard Cite

We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multi-layer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss. The ideas that we explore are to (a) take the limits of each hidden layer sequentially and (b) characterize the evolution of parameters in terms of their initialization. The limit satisfies a system of deterministic integro-differential equations. The proof uses methods from weak convergence and stochastic analysis. We show that, under suitable assumptions on the activation functions and the behavior for large times, the limit neural network recovers a global minimum (with zero loss for the objective function). IntroductionMachine learning, and in particular deep learning, has achieved immense practical success, revolutionizing fields such as image, text, and speech recognition. It is also increasingly being used in engineering, medicine, and finance. However, despite their success in practice, there is currently limited mathematical understanding of deep neural networks. This has motivated recent mathematical research on multi-layer learning models such as [39], [40], [41], [20], [21], [42], [49], [50], [43], and [48].Neural networks are nonlinear statistical models whose parameters are estimated from data using stochastic gradient descent (SGD) methods. Deep learning uses neural networks with many layers (i.e., "deep" neural networks), which produces a highly flexible, powerful and effective model in practice. Typically, a neural network with multiple layers between the input and the output layer is called a "deep" neural network, see for example [24]. We analyze multi-layer neural networks that have a fixed number of layers between the input and output layer, and where the number of hidden units in each layer becomes large.Applications of deep learning include image recognition (see [35] and [24]), facial recognition [59], driverless cars [6], speech recognition (see [35], [4], [36], and [60]), and text recognition (see [62] and [57]). Neural networks also find increasing more applications in engineering, robotics, medicine, and finance (see [37], [38], [58], [26], [47], [3], [51], [52], [53], and [54]).In this paper we characterize multi-layer neural networks in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. We rigorously prove the limit of the neural network output as the number of hidden units increases to infinity. The proof relies upon weak convergence analysis for stochastic processes. The result can be considered a "law of large numbers" for the neural network's output when both the network size and the number of stochastic gradient descent steps grow to infinity. We show that the neural network output in the large hidden-units and large SGD-iterat...

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Mean Field Analysis of Neural Networks: A Law of Large Numbers

Sirignano¹,

Spiliopoulos

2020

SIAM J. Appl. Math.

113

View full text Add to dashboard Cite

show abstract

“…In view of the aforementioned, two recent developments are relevant: sampling methods based on either optimal control or mean field approaches [20,21], and the application of Monte Carlo methods from statistical mechanics-especially molecular dynamics-to problems in machine learning or Bayesian inference [16,33]. Some of the modern stochastic optimisation methods from machine learning, like ADAM, AdaGrad or RMSProp adaptively control the learning rate so as to improve the convergence to a local minimum, but they also share many features with adaptive versions of the Langevin equation [17,25].…”

mentioning

confidence: 99%

Stochastic gradient descent and fast relaxation to thermodynamic equilibrium: a stochastic control approach

Breiten,

Hartmann,

Neureither

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the convergence to equilibrium of an underdamped Langevin equation that is controlled by a linear feedback force. Specifically, we are interested in sampling the possibly multimodal invariant probability distribution of a Langevin system at small noise (or low temperature), for which the dynamics can easily get trapped inside metastable subsets of the phase space. We follow [Chen et al., J. Math. Phys. 56, 113302, 2015] and consider a Langevin equation that is simulated at a high temperature, with the control playing the role of a friction that balances the additional noise so as to restore the original invariant measure at a lower temperature. We discuss different limits as the temperature ratio goes to infinity and prove convergence to a limit dynamics. It turns out that, depending on whether the lower ("target") or the higher ("simulation") temperature is fixed, the controlled dynamics converges either to the overdamped Langevin equation or to a deterministic gradient flow. This implies that (a) the ergodic limit and the large temperature separation limit do not commute in general, and that (b) it is not possible to accelerate the speed of convergence to the ergodic limit by making the temperature separation larger and larger. We discuss the implications of these observation from the perspective of stochastic optimisation algorithms and enhanced sampling schemes in molecular dynamics.

show abstract

“…In Hu et al [9] it has been shown that the marginal law m t converges towards m * , and this provides an algorithm to approximate the minimizer m * . Similar topics have been explored in [5,10,14].…”

Section: Introductionmentioning

confidence: 99%

“…This reformulation is crucial, because the potential function F defined above is convex in the measure space. In this paper, as in [9,10], we shall add an entropy term H(m) in order to regularize the problem. The regularized problem reads…”

Section: Introductionmentioning

confidence: 99%

Entropic fictitious play for mean field optimization problem

Ren,

Wang

2022

Preprint

Self Cite

View full text Add to dashboard Cite

It is well known that the training of the neural network can be viewed as a mean field optimization problem. In this paper we are inspired by the fictitious play, a classical algorithm in the game theory for learning the Nash equilibria, and propose a new algorithm, different from the conventional gradient-descent ones, to solve the mean field optimization. We rigorously prove its (exponential) convergence, and show some simple numerical examples.

show abstract

Mean-field Langevin dynamics and energy landscape of neural networks

Cited by 21 publications

References 48 publications

Mean Field Analysis of Neural Networks: A Law of Large Numbers

Mean Field Analysis of Neural Networks: A Law of Large Numbers

Stochastic gradient descent and fast relaxation to thermodynamic equilibrium: a stochastic control approach

Entropic fictitious play for mean field optimization problem

Contact Info

Product

Resources

About