Software for Distributed Computation on Medical Databases: A Demonstration Project

Narasimhan, Balasubramanian; Rubin, Daniel L.; Gross, Samuel M.; Bendersky, Marina; Lavori, Philip W.

doi:10.18637/jss.v077.i13

Cited by 12 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For GWAS, our pipeline leverages an algorithm for performing PCA by communicating the LD-matrices (Algorithm 1), and subsequently a method for performing generalized regression (most commonly, linear and logistic regression) by iteratively solving a regularized regression at each silo. This general, iterative approach, known as Alternating Directions Method of Multipliers (11; 12), is guaranteed to converge (13), has suc-2 cessfully been applied to many problems (14; 15; 16), and for GWAS we show that moderately high accuracies can be achieved with a few iterations (also see (14)). We show that our pipeline is accurate, scalable, practical, and a significant improvement over the meta 1.0 approach.…”

Section: Introductionmentioning

confidence: 90%

Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Pourshafeie

Bustamante

Prabhu

2018

Preprint

View full text Add to dashboard Cite

5Genome-wide association studies have been effective at revealing the genetic architecture of 6 simple traits. Extending this approach to more complex phenotypes has necessitated a massive 7 increase in cohort size. To achieve sufficient power, participants are recruited across multiple 8 collaborating institutions, leaving researchers with two choices: either collect all the raw data 9 at a single institution or rely on meta-analyses to test for association. In this work, we present 10 a third alternative. Here, we implement an entire GWAS workflow (quality control, population 11 structure control, and association) in a fully decentralized setting. Our iterative approach (a) 12 does not rely on consolidating the raw data at a single coordination center, and (b) does not 13 hinge upon large sample size assumptions at each silo. As we show, our approach overcomes 14 challenges faced by meta-studies when it comes to associating rare alleles and when case/control 15 proportions are wildly imbalanced at each silo. We demonstrate the feasibility of our method in 16 cohorts ranging in size from 2K (small) to 500K (large), and recruited across 2 to 10 collaborating 17 institutions. 18 1 Under Preparation Introduction 19Genome wide association studies (GWAS) are a popular approach to elucidate genetic architecture of 20 human phenotypes. This design has led to the discovery of many novel loci underpinning a panoply 21 of human traits (see Visscher et. al. for a recent review (1)). For traits driven by few variants with 22 large effects, moderately sized cohorts have been sufficient to power discovery. However, the GWAS 23 framework demands increasingly larger cohort sizes as the complexity of the trait grows. To achieve 24 required statistical power today, large, multi-institutional consortia are assembled under a common 25 data sharing agreement. 3; 4) or a combination of meta-and mega-analysis 26 (centralized analysis) (5; 6; 7; 8) constitute two major approaches to conducting GWAS. 27 Each approach offers merits and shortcomings. For mega-analysis, collecting all the data at every 28 analysis core is not only expensive and time consuming but also creates a security vulnerability at 29 each institution that hosts a copy of the data. Conversely, the meta-analysis approach eliminates 30 the need for data replication, but is more limited in flexibility. In particular, (a) subtle differences in 31 models, assumptions, and quality control (QC) can introduce biases in the results (9), (b) the shared 32 data and summary statistics might be inadequate for some types of inference (e.g. individual level 33 population structure control, conditional or joint analysis, etc.), and (c) parameter estimates can be 34 unreliable for rare variants or from centers contributing small sample sizes because the asymptotic 35 properties of maximum likelihood estimation theory may not hold (10). 36In this manuscript we develop a method that interpolates between centralized and meta-analysis 37 methods. Like meta-studies, our paradigm...

show abstract

Section: Introductionmentioning

confidence: 90%

Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Pourshafeie

Bustamante

Prabhu

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Both obstacles can be overcome by turning to distributed computations, which consists in leaving the data on sites and distributing the calculations, so that hospitals only share some intermediate results instead of the raw data (Narasimhan et al, 2017). Among other methods, SVD, which only involves inner products and sums, can be very straightforwardly implemented in a distributed manner.…”

Section: Imputation Of Multilevel Mixed Datamentioning

confidence: 99%

“…Indeed, all the computations in Algorithm 4 can be done in parallel with a master-slave architecture (Narasimhan et al, 2017), where a central server collects summary statistics computed locally on sites, as illustrated Figure 5. Here, the local right singular vectors v j , j ∈ {1, .…”

Section: Distributed Rank-q Pcamentioning

confidence: 99%

Imputation of Mixed Data With Multilevel Singular Value Decomposition

Husson

Josse

Narasimhan

et al. 2019

Journal of Computational and Graphical Statistics

Self Cite

View full text Add to dashboard Cite

Statistical analysis of large data sets offers new opportunities to better understand many processes. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources. As a consequence, such data sets often contain mixed data, i.e. both quantitative and qualitative and many missing values. Furthermore, aggregated data present a natural multilevel structure, where individuals or samples are nested within different sites, such as countries or hospitals. Imputation of multilevel data has therefore drawn some attention recently, but current solutions are not designed to handle mixed data, and suffer from important drawbacks such as their computational cost. In this article, we propose a single imputation method for multilevel data, which can be used to complete either quantitative, categorical or mixed data. The method is based on multilevel singular value decomposition (SVD), which consists in decomposing the variability of the data into two components, the 1 arXiv:1804.11087v1 [stat.AP] 30 Apr 2018 between and within groups variability, and performing SVD on both parts. We show on a simulation study that in comparison to competitors, the method has the great advantages of handling data sets of various size, and being computationally faster. Furthermore, it is the first so far to handle mixed data. We apply the method to impute a medical data set resulting from the aggregation of several data sets coming from different hospitals. This application falls in the framework of a larger project on Trauma patients. To overcome obstacles associated to the aggregation of medical data, we turn to distributed computation. The method is implemented in an R package.

show abstract

“…Brown et al 2010a;Brown et al 2010b) and open source software (e.g. Carter et al 2016;Narasimhan et al 2017). The Canadian Network for Observational Drug Effect Studies (CNODES, Suissa et al 2012) and Mini-Sentinel (a safety surveillance system developed by the U.S. Food and Drugs Administration, Platt and Carnahan, 2012) are both platforms to facilitate the running of analysis requests from approved users locally, along with disclosure checks, prior to securely combining the results centrally as a meta-analysis.…”

Section: Alternative Approachesmentioning

confidence: 99%

DataSHIELD – New Directions and Dimensions

Wilson

Butters

Avraam

et al. 2017

Data Science Journal

View full text Add to dashboard Cite

In disciplines such as biomedicine and social sciences, sharing and combining sensitive individual-level data is often prohibited by ethical-legal or governance constraints and other barriers such as the control of intellectual property or the huge sample sizes. DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual-levEL Databases) is a distributed approach that allows the analysis of sensitive individual-level data from one study, and the co-analysis of such data from several studies simultaneously without physically pooling them or disclosing any data.Following initial proof of principle, a stable DataSHIELD platform has now been implemented in a number of epidemiological consortia. This paper reports three new applications of DataSHIELD including application to post-publication sensitive data analysis, text data analysis and privacy protected data visualisation. Expansion of DataSHIELD analytic functionality and application to additional data types demonstrate the broad applications of the software beyond biomedical sciences.

show abstract

Software for Distributed Computation on Medical Databases: A Demonstration Project

Cited by 12 publications

References 16 publications

Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Caring without sharing: Meta-analysis 2.0 for massive genome-wide association studies

Imputation of Mixed Data With Multilevel Singular Value Decomposition

DataSHIELD – New Directions and Dimensions

Contact Info

Product

Resources

About