A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We characterize the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this error decays to zero with the number of features. Thus, overparameterization can be explicitly beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature -signal "bleeding" into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions. arXiv:1903.09139v2 [cs.LG] 9 Sep 2019 2. We provide a Fourier-theoretic interpretation of concurrent analyses [6-10] of the minimum 2 -norm interpolator.3. We show (Theorem 2) that parsimonious interpolators (like the 1 -minimizing interpolator and its relatives) suffer the complementary problem of overfitting pure noise.4. We construct two-step hybrid interpolators that successfully recover signal and harmlessly fit noise, achieving the order-optimal rate of test MSE among all interpolators (Proposition 1 and all its corollaries).
Related workWe discuss prior work in three categories: a) overparameterization in deep neural networks, b) interpolation of high-dimensional data using kernels, and c) high-dimensional linear regression. We then recap work on overparameterized linear regression that is concurrent to ours.
Recent interest in overparameterizationConventional statistical wisdom is that using more parameters in one's model than data points leads to poor generalization. This wisdom is corroborated in theory by worst-case generalization bounds on such overparameterized models following from VC-theory in classification [2] and ill-conditioning in least-squares regression [5]. It is, however, contradicted in practice by the notable recent trend of empirically successful overparameterized deep neural networks. For example, the commonly used CIFAR-10 dataset contains 60000 images, but the number of parameters in all the neural networks achieving state-of-the-art performance on CIFAR-10 is at least 1.5 million [4]. These neural networks have the ability to memorize pure noisesomehow, they are still able to generalize well when trained with meaningful data.Since the publication of this observation [4,11], the machine learning community has seen a flurry of activity to attempt to explain this phenomenon, both for classification and regression problems, in neural networks. The problem is challenging for three core reasons 2 :1. The optimization landscape for l...