We study how neural networks compress uninformative input space in models where data lie in d dimensions, but the labels of which only vary within a linear manifold of dimension d ∥ < d. We show that for a one-hidden-layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolves to become nearly insensitive to the d ⊥ = d − d ∥ uninformative directions. These are effectively compressed by a factor λ ∼ p , where p is the size of the training set. We quantify the benefit of such a compression on the test error ϵ. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that ϵ ∼ p −β , with β Lazy = d/(3d − 2). Compression improves the learning curves so that β Feature = (2d − 1)/(3d − 2) if d ∥ = 1 and β Feature = (d + d ⊥/2)/(3d − 2) if d ∥ > 1. We test these predictions for a stripe model where boundaries are parallel interfaces (d ∥ = 1) as well as for a cylindrical boundary (d ∥ = 2). Next, we show that compression shapes the neural tangent kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden-layer fully connected network trained on the stripe model and for a 16-layer convolutional neural network trained on the Modified National Institute of Standards and Technology database (MNIST), for which we also find β Feature > β Lazy. The great similarities found in these two cases support the idea that compression is central to the training of MNIST, and puts forward kernel principal component analysis on the evolving NTK as a useful diagnostic of compression in deep networks.
Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations R f correlates remarkably with the test error ϵ t. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures. For CIFAR10 and 15 known architectures we find ϵ t ≈ 0.2 R f , suggesting that obtaining a small R f is important to achieve good performance. We study how R f depends on the size of the training set and compare it to a simple model of invariant learning.
Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on four benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations R f correlates remarkably with the test error t . It is of order unity at initialization but decreases by several decades during training for state of the art architectures. For CIFAR10 and 15 known architectures we find t ≈ 0.2 R f , suggesting that obtaining a small R f is important to achieve good performance. We study how R f depends on the size of the training set and compare it to a simple model of invariant learning.Recently there has been a considerable effort to understand the benefits of learning features for one-hidden-layer fully connected nets, in the over-parametrized case where they tend to work best Preprint. Under review.
It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms. * Equal contribution (a coin was flipped).Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.