We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical/expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
As an example of the nonlinear Fokker-Planck equation, the mean field Langevin dynamics recently attracts attention due to its connection to (noisy) gradient descent on infinitely wide neural networks in the mean field regime, and hence the convergence property of the dynamics is of great theoretical interest. In this work, we give a simple and self-contained convergence rate analysis of the mean field Langevin dynamics with respect to the (regularized) objective function in both continuous and discrete time settings. The key ingredient of our proof is a proximal Gibbs distribution pq associated with the dynamics, which, in combination of techniques in Vempala and Wibisono (2019), allows us to develop a convergence theory parallel to classical results in convex optimization. Furthermore, we reveal that pq connects to the duality gap in the empirical risk minimization setting, which enables efficient empirical evaluation of the algorithm convergence.
We consider the linear model y = Xβ + with X ∈ R n×p in the overparameterized regime p > n. We estimate β via generalized (weighted) ridge regression: βλ = X X + λΣw † X y, whereΣw is the weighting matrix. Assuming a random effects model with general data covariance Σx and anisotropic prior on the true coefficients β , i.e., Eβ β = Σ β , we provide an exact characterization of the prediction risk E(y − x βλ ) 2 in the proportional asymptotic limit p/n → γ ∈ (1, ∞). Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting λopt for the ridge parameter λ and confirm the implicit 2 regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that λopt can be negative in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when X and β are non-isotropic. Finally, we determine the optimal Σw for both the ridgeless (λ → 0) and optimally regularized (λ = λopt) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.
We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of two-layer neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to establish. We show that neural networks in the mean field limit can be globally optimized by PDA. Furthermore, we characterize the convergence rate by leveraging convex optimization theory in finite-dimensional spaces. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
We study the first gradient descent step on the first-layer parameters W in a two-layer neural network:, where W ∈ R d×N , a ∈ R N are randomly initialized, and the training objective is the empirical MSE loss: 1 n n i=1 (f (xi) − yi) 2 . In the proportional asymptotic limit where n, d, N → ∞ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model f * . To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on W with learning rate η, when f * is a single-index model. We consider two scalings of the first step learning rate η. For small η, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large η, we prove that for certain f * , the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of random features and rotationally invariant kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.