In this paper we propose and study a class of simple, nonparametric, yet interpretable measures of association between two random variables X and Y taking values in general topological spaces. These nonparametric measures -defined using the theory of reproducing kernel Hilbert spaces -capture the strength of dependence between X and Y and have the property that they are 0 if and only if the variables are independent and 1 if and only if one variable is a measurable function of the other. Further, these population measures can be consistently estimated using the general framework of geometric graphs which include k-nearest neighbor graphs and minimum spanning trees. Moreover, a sub-class of these estimators are also shown to adapt to the intrinsic dimensionality of the underlying distribution. Some of these empirical measures can also be computed in near linear time. Under the hypothesis of independence between X and Y , these empirical measures (properly normalized) have a standard normal limiting distribution. Thus, these measures can also be readily used to test the hypothesis of mutual independence between X and Y . In fact, as far as we are aware, these are the only procedures that possess all the above mentioned desirable properties. Furthermore, when restricting to Euclidean spaces, we can make these sample measures of association finite-sample distribution-free, under the hypothesis of independence, by using multivariate ranks defined via the theory of optimal transport. The correlation coefficient proposed in Dette et al. [31], Chatterjee [22] and Azadkia and Chatterjee [7] can be seen as a special case of this general class of measures.
In this paper, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (I) testing for mutual independence between random vectors, and (II) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance (Székely et al. [142]) and energy statistic (Székely and Rizzo [141]) for testing scenarios (I) and (II) respectively. In both these problems we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are tuning-free, computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein's method for exchangeable pairs, which may be of independent interest.
In this paper we propose and study a class of simple, nonparametric, yet interpretable measures of conditional dependence between two random variables Y and Z given a third variable X, all taking values in general topological spaces. The population version of any of these nonparametric measures -defined using the theory of reproducing kernel Hilbert spaces (RKHSs) -captures the strength of conditional dependence and it is 0 if and only if Y and Z are conditionally independent given X, and 1 if and only if Y is a measurable function of Z and X. Thus, our measure -which we call kernel partial correlation (KPC) coefficient -can be thought of as a nonparametric generalization of the classical partial correlation coefficient that possesses the above properties when (X, Y, Z) is jointly normal. We describe two consistent methods of estimating KPC. Our first method of estimation is graph-based and utilizes the general framework of geometric graphs, including K-nearest neighbor graphs and minimum spanning trees. A sub-class of these estimators can be computed in near linear time and converges at a rate that automatically adapts to the intrinsic dimensionality of the underlying distribution(s). Our second strategy involves direct estimation of conditional mean embeddings using cross-covariance operators in the RKHS framework. Using these empirical measures we develop forward stepwise (high-dimensional) nonlinear variable selection algorithms. We show that our algorithm, using the graph-based estimator, yields a provably consistent model-free variable selection procedure, even in the high-dimensional regime when the number of covariates grows exponentially with the sample size, under suitable sparsity assumptions. Extensive simulation and real-data examples illustrate the superior performance of our methods compared to existing procedures. The recent conditional dependence measure proposed by Azadkia and Chatterjee [5] can also be viewed as a special case of our general framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.