Linear regression is a classical paradigm in statistics. A new look at it is provided via the lens of universal learning. In applying universal learning to linear regression the hypotheses class represents the label y ∈ R as a linear combination of the feature vector x T θ where x ∈ R M , within a Gaussian error. The Predictive Normalized Maximum Likelihood (pNML) solution for universal learning of individual data can be expressed analytically in this case, as well as its associated learnability measure. Interestingly, the situation where the number of parameters M may even be larger than the number of training samples N can be examined. As expected, in this case learnability cannot be attained in every situation; nevertheless, if the test vector resides mostly in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, linear regression can generalize despite the fact that it uses an "over-parametrized" model. We demonstrate the results with a simulation of fitting a polynomial to data with a possibly large polynomial degree.
Universal supervised learning is considered from an information theoretic point of view following the universal prediction approach, see Merhav and Feder (1998). We consider the standard supervised "batch" learning where prediction is done on a test sample once the entire training data is observed, and the individual setting where the features and labels, both in the training and test, are specific individual quantities. The information theoretic approach naturally uses the self-information loss or log-loss.Our results provide universal learning schemes that compete with a "genie" (or reference) that knows the true test label. In particular, it is demonstrated that the main proposed scheme, termed Predictive Normalized Maximum Likelihood (pNML), is a robust learning solution that outperforms the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability of the specific test challenge with the given training examples. The Model ClassThe model class definition plays an important role in all settings we consider. Specifically, a model class is a set of conditional probability distributionswhere Θ is a general index set. This is equivalent to saying that there is a set of stochastic functions {y = g θ (x), θ ∈ Θ} used to explain the relation between x and y.A major issue is how to choose a model class. As common sense indicates, on one hand one may wish to choose a large as possible class, so that any possible relation between x and y can be captured by some member in the class. However, if the class is too large, it may not be "learnable". That is, it will be impossible to deduce reliably on the large class based on the finite training example of size N . This notion appears in classical statistical reasoning and expressed, e.g., as the bias-variance trade-off. This major issue of choosing the model class will be discussed briefly towards the end of the paper, but throughout the paper we assume that P Θ is given.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.