A new nonparametric test is proposed for the multivariate two-sample problem. Similar to Rosenbaum's cross-match test, each observation is considered to be a vertex of a complete undirected weighted graph; interpoint distances are edge weights. A minimum-weight, r-regular subgraph is constructed, and the mean cross-count test statistic is equal to the number of edges in the subgraph containing one observation from the first group and one from the second, divided by r. Unequal distributions will tend to result in fewer edges that connect vertices between different groups. The mean cross-count test is sensitive to a wide range of distribution differences and has impressive power characteristics. We derive the first and second moments of the mean cross-count test, and note that simulation studies suggest this test statistic is asymptotically normal regardless of underlying data distributions. A small simulation study compares the power of the mean cross-count test to Hotelling's T 2 test and to the cross-match test. This new test is a more powerful generalization of Rosenbaum's test (the cross-match test is the case r = 1) and constitutes a noteworthy addition to the class of multivariate, nonparametric two-sample tests.Keywords: Distribution-free test; Graph-theoretic procedure; Change point 1 Background
ObjectiveConsider N = m + n independent multivariate observations Y 1 , …, Y m and Y m + 1 , …, Y N , where each Y i is drawn from distribution F for 1 ≤ i ≤ m and from distribution G for m + 1 ≤ i ≤ N. The dimension of the observations does not depend on N. The covariates may be quantitative or categorical; there need only exist some function, d, that measures distance between observations. The null hypothesis is that F = G. The objective is a twosample test that has little or no dependence on the underlying distribution of the data. Furthermore, this test should have sufficient power to be useful for applications.
MotivationWe follow in the vein of graph-theoretic tests for homogeneity: Consider each observation to be a vertex of a complete, undirected, weighted graph, G, and assign interpoint distances as edge weights. The distribution of these distances is sensitive to departures from homogeneity; Maa et al. (1996) prove that two distributions are identical if and only if the distributions of inter-point distances within and between observations sampled from the two populations are the same. Rafsky (1979, 1981) fit a minimum spanning tree