We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS H, even if the optimal predictor (i.e., the conditional expectation) is not in H. In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient descent), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in H.
Introduction.Positive-definite-kernel-based methods such as the support vector machine or kernel ridge regression are now widely used in many areas of science of engineering. They were first developed within the statistics community for non-parametric regression using splines, Sobolev spaces, and more generally reproducing kernel Hilbert spaces (see, e.g., [1]). Within the machine learning community, they were extended in several interesting ways (see, e.g., [2,3]): (a) other problems were tackled using positivedefinite kernels beyond regression problems, through the "kernelization" of classical unsupervised learning methods such as principal component analysis, canonical correlation analysis, or K-means, (b) efficient algorithms based on convex optimization have emerged, in particular for large sample sizes, and (c) kernels for non-vectorial data have been designed for objects like strings, graphs, measures, etc. A key feature is that they allow the separation of the representation problem (designing good kernels for non-vectorial data) and the algorithmic/theoretical problems (given a kernel, how to design, run efficiently and analyse estimation algorithms).The theoretical analysis of non-parametric least-squares regression within the RKHS framework is well understood. In particular, regression on input data in R d , d 1, and so-called Mercer kernels (continuous kernels over a compact set) that lead to dense subspaces of the space of square-integrable Primary 60K35, 60K35; secondary 60K35