Many mathematical convergence results for gradient descent (GD) based algorithms employ the assumption that the GD process is (almost surely) bounded and, also in concrete numerical simulations, divergence of the GD process may slow down, or even completely rule out, convergence of the error function. In practical relevant learning problems, it thus seems to be advisable to design the ANN architectures in a way so that GD optimization processes remain bounded. The property of the boundedness of GD processes for a given learning problem seems, however, to be closely related to the existence of minimizers in the optimization landscape and, in particular, GD trajectories may escape to infinity if the infimum of the error function (objective function) is not attained in the optimization landscape. This naturally raises the question of the existence of minimizers in the optimization landscape and, in the situation of shallow residual ANNs with multi-dimensional input layers and multi-dimensional hidden layers with the ReLU activation, the main result of this work answers this question affirmatively for a general class of loss functions and all continuous target functions. In our proof of this statement, we propose a kind of closure of the search space, where the limits are called generalized responses, and, thereafter, we provide sufficient criteria for the loss function and the underlying probability distribution which ensure that all additional artificial generalized responses are suboptimal which finally allows us to conclude the existence of minimizers in the optimization landscape.
In this article we establish new central limit theorems for Ruppert-Polyak averaged stochastic gradient descent schemes. Compared to previous work we do not assume that convergence occurs to an isolated attractor but instead allow convergence to a stable manifold. On the stable manifold the target function is constant and the oscillations in the tangential direction may be significantly larger than the ones in the normal direction. As we show, one still recovers a central limit theorem with the same rates as in the case of isolated attractors. Here we consider step-sizes γn = n −γ with γ ∈ ( 3 4 , 1), typically.
The realization function of a shallow ReLU network is a continuous and piecewise affine function f : R d → R, where the domain R d is partitioned by a set of n hyperplanes into cells on which f is affine. We show that the minimal representation for f uses either n, n + 1 or n + 2 neurons and we characterize each of the three cases. In the particular case, where the input layer is one-dimensional, minimal representations always use at most n + 1 neurons but in all higher dimensional settings there are functions for which n + 2 neurons are needed. Then we show that the set of minimal networks representing f forms a C ∞ -submanifold M and we derive the dimension and the number of connected components of M . Additionally, we give a criterion for the hyperplanes that guarantees that all continuous, piecewise affine functions are realization functions of appropriate ReLU networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.