Privacy protection is paramount in conducting health research. However, studies often rely on data stored in a centralized repository, where analysis is done with full access to the sensitive underlying content. Recent advances in federated learning enable building complex machine-learned models that are trained in a distributed fashion. These techniques facilitate the calculation of research study endpoints such that private data never leaves a given device or healthcare system. We show—on a diverse set of single and multi-site health studies—that federated models can achieve similar accuracy, precision, and generalizability, and lead to the same interpretation as standard centralized statistical models while achieving considerably stronger privacy protections and without significantly raising computational costs. This work is the first to apply modern and general federated learning methods that explicitly incorporate differential privacy to clinical and epidemiological research—across a spectrum of units of federation, model architectures, complexity of learning tasks and diseases. As a result, it enables health research participants to remain in control of their data and still contribute to advancing science—aspects that used to be at odds with each other.
Motivation: Phenomics is an emerging branch of modern biology, which uses high throughput phenotyping tools to capture multiple environment and phenotypic trait measurements, at a massive scale. The resulting high dimensional data sets represent a treasure trove of information for providing an indepth understanding of how multiple factors interact and contribute to control the growth and behavior of different plant crop genotypes. However, computational tools that can parse through such high dimensional data sets and aid in extracting plausible hypothesis are currently lacking. In this paper, we present a new algorithmic approach to effectively decode and characterize the role of environment on phenotypic traits, from complex phenomic data. To the best of our knowledge, this effort represents the first application of topological data analysis on phenomics data. Results: We applied this novel algorithmic approach on a real-world maize data set. Our results demonstrate the ability of our approach to delineate emergent behavior among subpopulations, as dictated by one or more environmental factors; notably, our approach shows how the environment plays a key role in determining the phenotypic behavior of one of the two genotypes. Availability: Downloadable Source code and test data are freely available with instruction set at https://xperthut.github.io/HYPPO-X.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.