We develop and analyze M -estimation methods for divergence functionals and the likelihood ratios of two probability distributions. Our method is based on a non-asymptotic variational characterization of f -divergences, which allows the problem of estimating divergences to be tackled via convex empirical risk optimization. The resulting estimators are simple to implement, requiring only the solution of standard convex programs. We present an analysis of consistency and convergence for these estimators. Given conditions only on the ratios of densities, we show that our estimators can achieve optimal minimax rates for the likelihood ratio and the divergence functionals in certain regimes. We derive an efficient optimization algorithm for computing our estimates, and illustrate their convergence behavior and practical viability by simulations. 1
This paper studies convergence behavior of latent mixing measures that arise in finite and infinite mixture models, using transportation distances (i.e., Wasserstein metrics). The relationship between Wasserstein distances on the space of mixing measures and f -divergence functionals such as Hellinger and Kullback-Leibler distances on the space of mixture distributions is investigated in detail using various identifiability conditions. Convergence in Wasserstein metrics for discrete measures implies convergence of individual atoms that provide support for the measures, thereby providing a natural interpretation of convergence of clusters in clustering applications where mixture models are typically employed. Convergence rates of posterior distributions for latent mixing measures are established, for both finite mixtures of multivariate distributions and infinite mixtures based on the Dirichlet process.
We show that the coarse-grained and fine-grained localization problems for ad hoc sensor networks can be posed and solved as a pattern recognition problem using kernel methods from statistical learning theory. This stems from an observation that the kernel function, which is a similarity measure critical to the effectiveness of a kernel-based learning algorithm, can be naturally defined in terms of the matrix of signal strengths received by the sensors. Thus we work in the natural coordinate system provided by the physical devices. This not only allows us to sidestep the difficult ranging procedure required by many existing localization algorithms in the literature, but also enables us to derive a simple and effective localization algorithm. The algorithm is particularly suitable for networks with densely distributed sensors, most of whose locations are unknown. The computations are initially performed at the base sensors and the computation cost depends only on the number of base sensors. The localization step for each sensor of unknown location is then performed locally in linear time. We present an analysis of the localization error bounds, and provide an evaluation of our algorithm on both simulated and real sensor networks.
The goal of binary classification is to estimate a discriminant function γ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or f -divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93-102. Univ. California Press, Berkeley] for the 0-1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2009, Vol. 37, No. 2, 876-904. This reprint differs from the original in pagination and typographic detail. 1 2 X. NGUYEN, M. J. WAINWRIGHT AND M. I. JORDANis generally assessed in terms of 0-1 loss as follows. Letting P denote the distribution of (X, Y ), and letting γ : X → R denote a given discriminant function, we seek to minimize the expectation of the 0-1 loss; that is, the error probability P(Y = sign(γ(X))). 2 Unfortunately, the 0-1 loss is a nonconvex function, and practical classification algorithms, such as boosting and the support vector machine, are based on relaxing the 0-1 loss to a convex upper bound or approximation, yielding a surrogate loss function to which empirical risk minimization procedures can be applied. A significant achievement of the recent literature on binary classification has been the delineation of necessary and sufficient conditions under which such relaxations yield Bayes consistency [2,9,12,13,19,22]. In many practical applications, this classical formulation of binary classification is elaborated to include an additional stage of "feature selection" or "dimension reduction," in which the covariate vector X is transformed into a vector Z according to a data-dependent mapping Q. An interesting example of this more elaborate formulation is a "distributed detection" problem, in which individual components of the d-dimensional covariate vector are measured at spatially separated locations, and there are communication constraints that limit the rate at which the measurements can be forwarded to a central location where the classification decision is made [21]. This communication-constrained setting imposes severe constraints on the choice of Q: any mapping Q must be a separable function, specified by a c...
We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels, and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm, and demonstrate its performance on both simulated and real data sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.