The UK Biobank (UKB) is a large cohort study of considerable empirical importance to fields such as medicine, epidemiology, statistical genetics, and the social sciences, due to its very large size (∼ 500,000 individuals) and its wide availability of variables. However, the UKB is not representative of its underlying sampling population. Selection bias due to volunteering (volunteer bias) is a known source of confounding. Individuals entering the UKB are more likely to be older, to be female, and of higher socioeconomic status. Using representative microdata from the UK Census as a reference, we document significant bias in estimated associations due to non-random selection into the UKB. For some associations, volunteer bias in the UKB is so severe that estimates have the opposite sign. E.g., older individuals in the UKB tend to be in better health. To aid researchers in correcting for volunteer bias in the UKB, we construct inverse probability weights based on UK census microdata. The use of these weights in weighted regressions reduces 78% of volunteer bias on average. Our inverse probability weights will be made available.
The implications of selection bias due to volunteering (volunteer bias) for genetic association studies are poorly understood. Because of its large sample size and extensive phenotyping, the UK Biobank (UKB) is included in almost all large genomewide association studies (GWAS) to date, as it is one of the largest cohorts. Yet, it is known to be highly selected. We develop inverse probability weighted GWAS (WGWAS) to estimate GWAS summary statistics in the UKB that are corrected for volunteer bias. WGWAS decreases the effective sample size substantially compared to GWAS by an average of 61% (from 337,543 to 130,684) depending on the phenotype. The extent to which volunteer bias affects GWAS associations and downstream results is phenotype-specific. Through WGWAS we find 11 novel genomewide significant loci for type 1 diabetes and 3 for breast cancer. These loci were not identified previously in any prior GWAS. Further, genetic variant's effect sizes and heritability estimates become more predictive in WGWAS for certain phenotypes (e.g., educational attainment, drinks per week, breast cancer and type 1 diabetes). WGWAS also alters biological annotation relations in gene-set analyses. This suggests that not accounting for volunteer-based selection can result in GWASs that suffer from bias, which in turn may drive spurious associations. GWAS consortia may therefore wish to provide population weights for their data sets or rely more on population-representative samples.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.