We study the necessary and sufficient complexity of ReLU neural networks - in terms of depth and number of weights - which is required for approximating classifier functions in an L-sense. As a model class, we consider the set E(R) of possibly discontinuous piecewise C functions f:[-12,12]→R, where the different "smooth regions" of f are separated by C hypersurfaces. For given dimension d≥2, regularity β>0, and accuracy ε>0, we construct artificial neural networks with ReLU activation function that approximate functions from E(R) up to an L error of ε. The constructed networks have a fixed number of layers, depending only on d and β, and they have O(ε) many nonzero weights, which we prove to be optimal. For the proof of optimality, we establish a lower bound on the description complexity of the class E(R). By showing that a family of approximating neural networks gives rise to an encoder for E(R), we then prove that one cannot approximate a general function f∈E(R) using neural networks that are less complex than those produced by our construction. In addition to the optimality in terms of the number of weights, we show that in order to achieve this optimal approximation rate, one needs ReLU networks of a certain minimal depth. Precisely, for piecewise C(R) functions, this minimal depth is given - up to a multiplicative constant - by β∕d. Up to a log factor, our constructed networks match this bound. This partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions. Finally, we analyze approximation in high-dimensional spaces where the function f to be approximated can be factorized into a smooth dimension reducing feature map τ and classifier function g - defined on a low-dimensional feature space - as f=g∘τ. We show that in this case the approximation rate depends only on the dimension of the feature space and not the input dimension.
We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in L 2 (R d ). In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems-so-called affine systems-can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, α-shearlets, and more generally α-molecules. Our central result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of α −1 -cartoon-like functions, which is approximated optimally by α-shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm generates deep neural networks providing close-to-optimal approximation rates. Moreover, these results indicate that stochastic gradient descent can actually learn approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.Throughout the paper, we consider the case Φ : R d → R, i.e., N L = 1, which includes situations such as the classification and temperature prediction problem described above. We emphasize, however, that the general results of Sections 3, 4, and 5 are readily generalized to N L > 1.We denote the class of networks Φ : R d → R with exactly L layers, connectivity no more than M , and activation function ρ by NN L,M,d,ρ with the understanding that for L = 1, the set NN L,M,d,ρ is empty. Moreover, we let NN ∞,M,d,ρ := L∈N NN L,M,d,ρ , NN L,∞,d,ρ := M ∈N NN L,M,d,ρ , NN ∞,∞,d,ρ := L∈N NN L,∞,d,ρ .Now, given a function f : R d → R, we are interested in the theoretically best possible approximation of f by a network Φ ∈ NN ∞,M,d,ρ . Specifically, we will want to know how the approximation quality depends on the connectivity M and what the associated number of bits needed to store the network topology 7 i=1 c i f (· − d i ) is compactly supported, has 7 vanishing moments in x 1 -direction, andĝ(ξ) = 0 for all ξ ∈ [−3, 3] 2 such that ξ 1 = 0. Then, by Theorem 6.4 and Remark 6.7 there exists δ > 0 such that SH α (f, g, δ; Ω) is optimal for E 1/α (Ω; ν). We definewhere we order (A j ) j∈N such that |det(A j )| ≤ |det(A j+1 )|, for all j ∈ N. This construction implies that the α-shearlet system SH α (f, g, δ; Ω) is an affi...
We analyze approximation rates of deep ReLU neural networks for Sobolev-regular functions with respect to weaker Sobolev norms. First, we construct, based on a calculus of ReLU networks, artificial neural networks with ReLU activation functions that achieve certain approximation rates. Second, we establish lower bounds for the approximation by ReLU neural networks for classes of Sobolev-regular functions. Our results extend recent advances in the approximation theory of ReLU networks to the regime that is most relevant for applications in the numerical analysis of partial differential equations. ,which encourages the network to encode information about the derivatives of f in its weights. The authors of [16] call this method Sobolev training and reported reduced generalization errors and better data-efficiency in a network compression task (see [31]) and in application to synthetic gradients (see [34]). In case of network compression, the approximated function f is a function realized by a possibly very large neural network N large (·|w), that has been trained for some supervised learning task and is learnt by a smaller network N small . In contrast to usual supervised learning settings, the approximated function f (·) = N large (·|w) is known and the derivatives can be computed.• Motivated by the performance of deep learning-based solutions in classical machine learning tasks and, in particular, by their ability to overcome the curse of dimension, neural networks are now also applied for the approximative solution of partial differential equations (PDEs) (see [26,36,54,59]).In [54] the authors present their deep Galerkin method for approximating solutions of high-dimensional quasilinear parabolic PDEs. For this, a functional J(f ) encoding the differential operator, boundary conditions, and initial conditions is introduced. A neural network N PDE with weights w is then trained to minimize the functional J(N PDE (w)). This is done by a discretization and randomly sampling spatial points.The theoretical foundation for approximating a function and higher-order derivatives with a neural network was already given in a less known version of the universal approximation theorem by Hornik in [32, Theorem 3]. In particular, it was shown that if the activation function ̺ is k-times continuously differentiable, non-constant, and bounded, then any k-times continuously differentiable function f and its derivatives up to order k can be uniformly approximated by a shallow neural network on compact sets. Note though that the conditions on the activation function are very restrictive and that, for example, the ReLU is not included in the above result. However, in [16], it was shown that the theorem also holds for shallow ReLU networks if k = 1. Theorem 3 in [32] was also used in [54] to show the existence of a shallow network approximating solutions of the PDEs considered in this paper. An important aspect, that is untouched by the previous approximation results is how the complexity of a network and, in particular, its depth...
Approximation rate bounds for emulations of real-valued functions on intervals by deep neural networks (DNNs) are established. The approximation results are given for DNNs based on ReLU activation functions. The approximation error is measured with respect to Sobolev norms. It is shown that ReLU DNNs allow for essentially the same approximation rates as nonlinear, variable-order, free-knot (or so-called “[Formula: see text]-adaptive”) spline approximations and spectral approximations, for a wide range of Sobolev and Besov spaces. In particular, exponential convergence rates in terms of the DNN size for univariate, piecewise Gevrey functions with point singularities are established. Combined with recent results on ReLU DNN approximation of rational, oscillatory, and high-dimensional functions, this corroborates that continuous, piecewise affine ReLU DNNs afford algebraic and exponential convergence rate bounds which are comparable to “best in class” schemes for several important function classes of high and infinite smoothness. Using composition of DNNs, we also prove that radial-like functions obtained as compositions of the above with the Euclidean norm and, possibly, anisotropic affine changes of co-ordinates can be emulated at exponential rate in terms of the DNN size and depth without the curse of dimensionality.
We consider regions of images that exhibit smooth statistics, and pose the question of characterizing the "essence" of these regions that matters for recognition. Ideally, this would be a statistic (a function of the image) that does not depend on viewpoint and illumination, and yet is sufficient for the task. In this manuscript, we show that such statistics exist. That is, one can compute deterministic functions of the image that contain all the "information" present in the original image, except for the effects of viewpoint and illumination. We also show that such statistics are supported on a "thin" (zero-measure) subset of the image domain, and thus the "information" in an image that is relevant for recognition is sparse. Yet, from this thin set one can reconstruct an image that is equivalent to the original up to a change of viewpoint and local illumination (contrast). Finally, we formalize the notion of "information" an image contains for the purpose of viewpoint-and illuminationinvariant tasks, which we call "actionable information" following ideas of J. J. Gibson.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.