We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multi-layer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss. The ideas that we explore are to (a) take the limits of each hidden layer sequentially and (b) characterize the evolution of parameters in terms of their initialization. The limit satisfies a system of deterministic integro-differential equations. The proof uses methods from weak convergence and stochastic analysis. We show that, under suitable assumptions on the activation functions and the behavior for large times, the limit neural network recovers a global minimum (with zero loss for the objective function).
IntroductionMachine learning, and in particular deep learning, has achieved immense practical success, revolutionizing fields such as image, text, and speech recognition. It is also increasingly being used in engineering, medicine, and finance. However, despite their success in practice, there is currently limited mathematical understanding of deep neural networks. This has motivated recent mathematical research on multi-layer learning models such as [39], [40], [41], [20], [21], [42], [49], [50], [43], and [48].Neural networks are nonlinear statistical models whose parameters are estimated from data using stochastic gradient descent (SGD) methods. Deep learning uses neural networks with many layers (i.e., "deep" neural networks), which produces a highly flexible, powerful and effective model in practice. Typically, a neural network with multiple layers between the input and the output layer is called a "deep" neural network, see for example [24]. We analyze multi-layer neural networks that have a fixed number of layers between the input and output layer, and where the number of hidden units in each layer becomes large.Applications of deep learning include image recognition (see [35] and [24]), facial recognition [59], driverless cars [6], speech recognition (see [35], [4], [36], and [60]), and text recognition (see [62] and [57]). Neural networks also find increasing more applications in engineering, robotics, medicine, and finance (see [37], [38], [58], [26], [47], [3], [51], [52], [53], and [54]).In this paper we characterize multi-layer neural networks in the asymptotic regime of large network sizes and large numbers of stochastic gradient descent iterations. We rigorously prove the limit of the neural network output as the number of hidden units increases to infinity. The proof relies upon weak convergence analysis for stochastic processes. The result can be considered a "law of large numbers" for the neural network's output when both the network size and the number of stochastic gradient descent steps grow to infinity. We show that the neural network output in the large hidden-units and large SGD-iterat...