The past years have seen the rise of genomic biobanks and mega-scale meta-analysis of genomic data, which promises to reveal the genetic underpinnings of health and disease. However, the over-representation of Europeans in genomic studies not only limits the global understanding of disease risk but also inhibits viable research into the genomic differences between carriers and patients. Whilst the community has agreed that more diverse samples are required, it is not enough to blindly increase diversity; the diversity must be quantified, compared and annotated to lead to insight. Genetic annotations from separate biobanks need to be comparable and computable and to operate without access to raw data due to privacy concerns. Comparability is key both for regular research and to allow international comparison in response to pandemics. Here, we evaluate the appropriateness of the most common genomic tools used to depict population structure in a standardized and comparable manner. The end goal is to reduce the effects of confounding and learn from genuine variation in genetic effects on phenotypes across populations, which will improve the value of biobanks (locally and internationally), increase the accuracy of association analyses and inform developmental efforts.
The past years saw the rise of genomic biobanks and mega-scale meta-analysis of genomic data that promise to reveal the genetic underpinnings of health and disease. However, the overrepresentation of Europeans in genomic studies not only limit the global understanding of disease risk and intervention efficacy, but also inhibit viable research into the genomic differences between carriers and patients. Whilst the community has agreed that more diverse samples are required, it is not enough to blindly increase diversity; the diversity must be quantified, compared, and annotated to lead to insight. Genetic annotations from separate biobanks need to be comparable, computable, operate without access to raw data due to privacy concerns. Comparability is key both for regular research and to allow international comparison in response to pandemics. Here, we evaluate the appropriateness of commonly used genomic tools used to depict population structure in a standardized and comparable manner. The end goal is to reduce the effects of confounding and learn from genuine variation in genetic effects on phenotypes across populations, which will improve the value of biobanks, locally and internationally, increase the accuracy of association analyses, and inform developmental efforts. Admixed PopulationA population of individuals with ancestors from two or more relatively distinct populations relatively recent in human history. Admixture mappingGene mapping of susceptibility alleles for genetic disease that show differential risk by ancestry, correlating the degree of ancestry near to genomic regions with greater disease risk. Bayesian clusteringAssignment of individuals to clusters based on genetic similarity without assuming predefined populations, using statistical methods that allow inferences to be drawn from the data and prior information. Expectation-Maximisation (EM) algorithmAn iterative method to find maximum likelihood estimates (MLE) of parameters in statistical models, altering between an expectation step (creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters) and a maximisation step (computes parameters maximizing the output of the expectation step). Genetic Relatedness MatrixThe GRM represents the genomic similarities among all individuals. Each cell in the matrix measures the genotypic correlation between a pair of individuals (the rows and columns). The GRM can be used with a phenotypic distance matrix to estimate heritability without estimating the phenotypic effect of individual SNPs. Hidden Markov Model (HMM)A statistical Markov model, that is a randomly changing system assumed to consist of future states that only depend on current states, whereby states are unobservable (hidden). Markov Chain Monte Carlo (MCMC)A simulation method used in Bayesian calculations, incorporating a class of algorithms that can obtain a sample of the desired distribution by observing several steps of the Markov chain, which is a sequence of a probability of events that ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.