This paper is concerned by the statistical analysis of data sets whose elements are random histograms. For the purpose of learning principal modes of variation from such data, we consider the issue of computing the PCA of histograms with respect to the 2-Wasserstein distance between probability measures. To this end, we propose to compare the methods of log-PCA and geodesic PCA in the Wasserstein space as introduced in [BGKL15, SC15]. Geodesic PCA involves solving a non-convex optimization problem. To solve it approximately, we propose a novel forward-backward algorithm. This allows a detailed comparison between log-PCA and geodesic PCA of one-dimensional histograms, which we carry out using various datasets, and stress the benefits and drawbacks of each method. We extend these results for two-dimensional data and compare both methods in that setting.
In this paper, a regularization of Wasserstein barycenters for random measures supported on R d is introduced via convex penalization. The existence and uniqueness of such barycenters is first proved for a large class of penalization functions. The Bregman divergence associated to the penalization term is then considered to obtain a stability result on penalized barycenters. This allows the comparison of data made of n absolutely continuous probability measures, within the more realistic setting where one only has access to a dataset of random variables sampled from unknown distributions. The convergence of the penalized empirical barycenter of a set of n iid random probability measures towards its population counterpart is finally analyzed. This approach is shown to be appropriate for the statistical analysis of either discrete or absolutely continuous random measures. It also allows to construct, from a set of discrete measures, consistent estimators of population Wasserstein barycenters that are absolutely continuous. * J. Bigot is a member Statistical inference using optimal transport. The penalized barycenter problem is motivated by the nonparametric method introduced in [BFS12] for the classical density estimation problem from discrete samples. It is based on a variational regularization approach involving the Wasserstein distance as a data fidelity term. However, the adaptation of this work for the penalization of Wasserstein barycenter has not been considered so far.Consistent estimators of population Wasserstein barycenters. Tools from optimal transport are used in [PZ16] for the registration of multiple point processes which represent repeated observations organized in samples from independent subjects or experimental units. The authors in [PZ16] proposed a consistent estimator of the population Wasserstein barycenter of multiple point processes in the case d = 1, and an extension of their methodology for d ≥ 2 is considered in [PZ17]. Their method contains two steps. A kernel smoothing is first applied to the data which leads to a set of a.c. measures from which an empirical Wasserstein barycenter is computed in a second step. Our approach thus differs from [PZ16, PZ17] since we directly include the smoothing step in the computation of a Wasserstein barycenter via the penalty function E in (1.3). Also notice that estimators of population Wasserstein barycenter are shown to be consistent for the Wasserstein metric W 2 in [PZ16, PZ17], whereas we prove the consistency of our approach for metrics in the space of pdf supported on R d . Finally, rates of convergence (for the Wasserstein metric W 2 ) of empirical Wasserstein barycenters computed from discrete measures, supported on the real line only, are discussed in [PZ16, BGKL18]. Generalized notions of Wasserstein barycenters.A detailed characterization of empirical Wasserstein barycenters in terms of existence, uniqueness and regularity for probability measures with support in R d is given in the seminal paper [AC11]. The relation of such barycenters wi...
The notion of entropy-regularized optimal transport, also known as Sinkhorn divergence, has recently gained popularity in machine learning and statistics, as it makes feasible the use of smoothed optimal transportation distances for data analysis. The Sinkhorn divergence allows the fast computation of an entropically regularized Wasserstein distance between two probability distributions supported on a finite metric space of (possibly) high-dimension. For data sampled from one or two unknown probability distributions, we derive the distributional limits of the empirical Sinkhorn divergence and its centered version (Sinkhorn loss). We also propose a bootstrap procedure which allows to obtain new test statistics for measuring the discrepancies between multivariate probability distributions. Our work is inspired by the results of Sommerfeld and Munk in [32] on the asymptotic distribution of empirical Wasserstein distance on finite space using unregularized transportation costs. Incidentally we also analyze the asymptotic distribution of entropy-regularized Wasserstein distances when the regularization parameter tends to zero. Simulated and real datasets are used to illustrate our approach.
We present a framework to simultaneously align and smooth data in the form of multiple point clouds sampled from unknown densities with support in a d-dimensional Euclidean space. This work is motivated by applications in bioinformatics where researchers aim to automatically homogenize large datasets to compare and analyze characteristics within a same cell population. Inconveniently, the information acquired is most certainly noisy due to mis-alignment caused by technical variations of the environment. To overcome this problem, we propose to register multiple point clouds by using the notion of regularized barycenters (or Fréchet mean) of a set of probability measures with respect to the Wasserstein metric. A first approach consists in penalizing a Wasserstein barycenter with a convex functional as recently proposed in [5]. A second strategy is to transform the Wasserstein metric itself into an entropy regularized transportation cost between probability measures as introduced in [12]. The main contribution of this work is to propose data-driven choices for the regularization parameters involved in each approach using the Goldenshluger-Lepski's principle. Simulated data sampled from Gaussian mixtures are used to illustrate each method, and an application to the analysis of flow cytometry data is finally proposed. This way of choosing of the regularization parameter for the Sinkhorn barycenter is also analyzed through the prism of an oracle inequality that relates the error made by such data-driven estimators to the one of an ideal estimator.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.