We follow the line of using classifiers for two-sample testing and propose several tests based on the Random Forest classifier. The developed tests are easy to use, require no tuning and are applicable for any distribution on R p , even in high-dimensions. We provide a comprehensive treatment for the use of classification for two-sample testing, derive the distribution of our tests under the Null and provide a power analysis, both in theory and with simulations. To simplify the use of the method, we also provide the R-package "hypoRF".
Background Numerous studies have shown that specific components of breast milk, considered separately, are associated with disease status in the mother or the child using univariate analyses. However, very few studies have considered multivariate analysis approaches to evaluate the relationship between multiple breast milk components simultaneously. Aim Here we aimed at visualizing breast milk component complex interactions in the context of the allergy status of the mother or the child. Methods Milk samples were collected from lactating mothers participating in the Leipziger Forschungszentrum fü r Zivilisationskrankheiten (LIFE) Child cohort in Leipzig, Germany. A total of 156 breast milk samples, collected at 3 months after birth from mother/infant pairs, were analyzed for 51 breast milk components. Correlation, principal component analysis (PCA) and graphical discovery analysis were used. Result Correlations ranging from 0.40 to 0.96 were observed between breast milk fatty acid and breast milk phospholipids levels and correlations ranging from 0 to 0.76 between specific human milk oligosaccharides (HMO) were observed. No separation of the data based on the risk of allergy in the infants was identified using PCA. When graphical discovery analysis was used, dependencies between maternal plasma immunoglobulin E (IgE) level and the
We propose an adaptation of the Random Forest algorithm to estimate the conditional distribution of a possibly multivariate response. We suggest a new splitting criterion based on the MMD two-sample test, which is suitable for detecting heterogeneity in multivariate distributions. The weights provided by the forest can be conveniently used as an input to other methods in order to locally solve various learning problems. The code is available as R-package drf.
The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing. The outcome of a classification-based two-sample test remains a rejection decision, which is not always informative since the null hypothesis is seldom strictly true. Therefore, when a test rejects, it would be beneficial to provide an additional quantity serving as a refined measure of distributional difference. In this work, we introduce a framework for the construction of high-probability lower bounds on the total variation distance. These bounds are based on a one-dimensional projection, such as a classification or regression method, and can be interpreted as the minimal fraction of samples pointing towards a distributional difference. We further derive asymptotic power and detection rates of two proposed estimators and discuss potential uses through an application to a reanalysis climate dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.