2022
DOI: 10.1101/2022.12.12.520180
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Leveraging a machine learning derived surrogate phenotype to improve power for genome-wide association studies of partially missing phenotypes in population biobanks

Abstract: While population biobanks have dramatically expanded opportunities for genome-wide association studies (GWAS), these large-scale analyses bring new statistical challenges. A key bottleneck is that phenotypes of interest are often partially missing. For example, phenotypes derived from specialized imaging modalities are often only measured for a subset of the cohort. Fortunately, biobanks contain surrogate phenotype information, in the form of routinely collected clinical data, that can often be leveraged to bu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 63 publications
0
4
0
Order By: Relevance
“…Using 61,838 TOPMed lipids samples, it took 8 hours using 250 2.10-GHz computing cores with 12-GB memory for single-variant multi-trait analysis, which is scalable for large WGS/WES datasets. On the other hand, MultiSTAAR could be further extended to allow for dynamic windows with data-adaptive sizes in genetic region analysis 24,44 , to properly leverage synthetic surrogates in the presence of partially missing phenotypes 45 , and to incorporate summary statistics for meta-analysis of multiple WGS/WES studies 46 .…”
Section: Discussionmentioning
confidence: 99%
“…Using 61,838 TOPMed lipids samples, it took 8 hours using 250 2.10-GHz computing cores with 12-GB memory for single-variant multi-trait analysis, which is scalable for large WGS/WES datasets. On the other hand, MultiSTAAR could be further extended to allow for dynamic windows with data-adaptive sizes in genetic region analysis 24,44 , to properly leverage synthetic surrogates in the presence of partially missing phenotypes 45 , and to incorporate summary statistics for meta-analysis of multiple WGS/WES studies 46 .…”
Section: Discussionmentioning
confidence: 99%
“…As such, it is essential to thoroughly validate results with external data. In future work, this could be addressed with multiple imputation 53 or with downstream tests that allow different effect sizes or noise variances between imputed and observed phenotypes 54 .…”
Section: Discussionmentioning
confidence: 99%
“…Additionally, POP-GWAS is both user-friendly and computationally efficient. Compared to joint models for primary and surrogate phenotypes 49 , our approach only requires three sets of GWAS summary statistics as input and completes GWAS analysis for millions of SNPs within minutes. We have also made necessary extensions to account for binary phenotypes, sample relatedness, and overlapping samples between summary statistics datasets, making POP-GWAS a versatile tool suitable for broad applications.…”
Section: Discussionmentioning
confidence: 99%