2018
DOI: 10.1007/s40687-018-0148-y
|View full text |Cite
|
Sign up to set email alerts
|

Deep relaxation: partial differential equations for optimizing deep neural networks

Abstract: In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regula… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
102
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 95 publications
(104 citation statements)
references
References 47 publications
1
102
0
1
Order By: Relevance
“…We set Y−1 = Y0 to denote the initial condition. Similar to the symplectic integration in (10), this scheme is reversible. We show that the second-order network is stable in the sense of (6) when we assume stationary weights.…”
Section: Hyperbolic Cnnsmentioning
confidence: 99%
“…We set Y−1 = Y0 to denote the initial condition. Similar to the symplectic integration in (10), this scheme is reversible. We show that the second-order network is stable in the sense of (6) when we assume stationary weights.…”
Section: Hyperbolic Cnnsmentioning
confidence: 99%
“…In [33], the authors proposed a Lipschitz regularization term to the optimization problem and showed (theoretically) that the output of the regularized network converges to the correct classifier when the data satisfies certain conditions. In addition, there are several recent works that have made connections between optimization in deep learning and numerical methods for partial differential equations, in particular, the entropy-based stochastic gradient descent [6] and a Hamilton-Jacobi relaxation [7]. For a review of some other recent mathematical approaches to DNN, see [45] and the citations within.…”
Section: Introductionmentioning
confidence: 99%
“…We will now focus on the third term in (4.6). Using Hölder's inequality together with Doob's L p inequality, see for instance [14,Theorem 1,§3,p.20], and A3, we get…”
Section: 1mentioning
confidence: 99%
“…where Σ is a covariance matrix and dW t is a standard m-dimensional Wiener process defined on a probability space. The idea of approximating stochastic gradient descent with a continuous time process has been noted by several authors, see [3,4,6,13,30,31]. A special case of what we prove in this paper, see Theorem 2.7 below, is that the stochastic gradient descent (1.7) used to minimize the risk for the ResNet model in (1.5) converges to the stochastic gradient descent used to minimize the risk for the Neural ODE model in (1.4).…”
Section: Introductionmentioning
confidence: 99%