Universal supervised learning is considered from an information theoretic point of view following the universal prediction approach, see Merhav and Feder (1998). We consider the standard supervised "batch" learning where prediction is done on a test sample once the entire training data is observed, and the individual setting where the features and labels, both in the training and test, are specific individual quantities. The information theoretic approach naturally uses the self-information loss or log-loss.Our results provide universal learning schemes that compete with a "genie" (or reference) that knows the true test label. In particular, it is demonstrated that the main proposed scheme, termed Predictive Normalized Maximum Likelihood (pNML), is a robust learning solution that outperforms the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability of the specific test challenge with the given training examples.
The Model ClassThe model class definition plays an important role in all settings we consider. Specifically, a model class is a set of conditional probability distributionswhere Θ is a general index set. This is equivalent to saying that there is a set of stochastic functions {y = g θ (x), θ ∈ Θ} used to explain the relation between x and y.A major issue is how to choose a model class. As common sense indicates, on one hand one may wish to choose a large as possible class, so that any possible relation between x and y can be captured by some member in the class. However, if the class is too large, it may not be "learnable". That is, it will be impossible to deduce reliably on the large class based on the finite training example of size N . This notion appears in classical statistical reasoning and expressed, e.g., as the bias-variance trade-off. This major issue of choosing the model class will be discussed briefly towards the end of the paper, but throughout the paper we assume that P Θ is given.