We study the necessary and sufficient complexity of ReLU neural networks - in terms of depth and number of weights - which is required for approximating classifier functions in an L-sense. As a model class, we consider the set E(R) of possibly discontinuous piecewise C functions f:[-12,12]→R, where the different "smooth regions" of f are separated by C hypersurfaces. For given dimension d≥2, regularity β>0, and accuracy ε>0, we construct artificial neural networks with ReLU activation function that approximate functions from E(R) up to an L error of ε. The constructed networks have a fixed number of layers, depending only on d and β, and they have O(ε) many nonzero weights, which we prove to be optimal. For the proof of optimality, we establish a lower bound on the description complexity of the class E(R). By showing that a family of approximating neural networks gives rise to an encoder for E(R), we then prove that one cannot approximate a general function f∈E(R) using neural networks that are less complex than those produced by our construction. In addition to the optimality in terms of the number of weights, we show that in order to achieve this optimal approximation rate, one needs ReLU networks of a certain minimal depth. Precisely, for piecewise C(R) functions, this minimal depth is given - up to a multiplicative constant - by β∕d. Up to a log factor, our constructed networks match this bound. This partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions. Finally, we analyze approximation in high-dimensional spaces where the function f to be approximated can be factorized into a smooth dimension reducing feature map τ and classifier function g - defined on a low-dimensional feature space - as f=g∘τ. We show that in this case the approximation rate depends only on the dimension of the feature space and not the input dimension.
In this paper we show that the Fourier transform induces an isomorphism between the coorbit spaces defined by Feichtinger and Gröchenig of the mixed, weighted Lebesgue spaces L p,q v with respect to the quasi-regular representation of a semi-direct product R d H with suitably chosen dilation group H, and certain decomposition spaces D (Q, L p , q u ) (essentially as introduced by Feichtinger and Gröbner) where the localized "parts" of a function are measured in the F L p -norm. This equivalence is useful in several ways: It provides access to a Fourier-analytic understanding of wavelet coorbit spaces, and it allows to discuss coorbit spaces associated to different dilation groups in a common framework. As an illustration of these points, we include a short discussion of dilation invariance properties of coorbit spaces associated to different types of dilation groups.
We analyze the topological properties of the set of functions that can be implemented by neural networks of a fixed size. Surprisingly, this set has many undesirable properties. It is highly non-convex, except possibly for a few exotic activation functions. Moreover, the set is not closed with respect to $$L^p$$ L p -norms, $$0< p < \infty $$ 0 < p < ∞ , for all practically used activation functions, and also not closed with respect to the $$L^\infty $$ L ∞ -norm for all practically used activation functions except for the ReLU and the parametric ReLU. Finally, the function that maps a family of weights to the function computed by the associated network is not inverse stable for every practically used activation function. In other words, if $$f_1, f_2$$ f 1 , f 2 are two functions realized by neural networks and if $$f_1, f_2$$ f 1 , f 2 are close in the sense that $$\Vert f_1 - f_2\Vert _{L^\infty } \le \varepsilon $$ ‖ f 1 - f 2 ‖ L ∞ ≤ ε for $$\varepsilon > 0$$ ε > 0 , it is, regardless of the size of $$\varepsilon $$ ε , usually not possible to find weights $$w_1, w_2$$ w 1 , w 2 close together such that each $$f_i$$ f i is realized by a neural network with weights $$w_i$$ w i . Overall, our findings identify potential causes for issues in the training procedure of deep learning such as no guaranteed convergence, explosion of parameters, and slow convergence.
This paper provides maximal function characterizations of anisotropic Triebel-Lizorkin spaces associated to general expansive matrices for the full range of parameters p ∈ (0, ∞), q ∈ (0, ∞] and α ∈ R. The equivalent norm is defined in terms of the decay of wavelet coefficients, quantified by a Peetre-type space over a one-parameter dilation group. For the Banach space regime p, q ≥ 1, we use this characterization to prove the existence of frames and Riesz sequences of dual molecules for the Triebel-Lizorkin spaces; the atoms are obtained by translations and anisotropic dilations of a single function, where neither the translation nor dilation parameters are required to belong to a discrete subgroup. Explicit criteria for molecules are given in terms of smoothness, decay and moment conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.