XGMix: Local-Ancestry Inference with Stacked XGBoost

Kumar, Arvind; Montserrat, Daniel Mas; Bustamante, Carlos D.; Ioannidis, Alexander

doi:10.1101/2020.04.21.053876

Cited by 8 publications

(8 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other purely discriminative approaches have recently been developed. Kumar et al (2020) have described an approach that employs boosted gradient trees to perform local ancestry inference much faster and with fewer computational resources than existing methods, while maintaining comparable accuracy. A similar method, using neural networks has also been described recently (Montserrat et al, 2020).…”

Section: Discussionmentioning

confidence: 99%

“…Instead, they attempt to learn directly from segments of known ancestry the conditional distribution of ancestries given haplotype data. Discrimina-tive models make fewer assumptions about the demographic process underlying admixture and typically scale better to large datasets (Omberg et al, 2012; Kumar et al, 2020). A number of discriminative approaches have been described (Brisbin et al, 2012; Omberg et al, 2012; Maples et al, 2013; Kumar et al, 2020; Montserrat et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

Durand

Chuong

Wilton

et al. 2021

Preprint

View full text Add to dashboard Cite

Ancestry deconvolution is the task of identifying the ancestral origins of chromosomal segments of admixed individuals. It has important applications, from mapping disease genes to identifying loci potentially under natural selection. However, most existing methods are limited to a small number of ancestral populations and are unsuitable for large-scale applications.In this article, we describe Ancestry Composition, a modular pipeline for accurate and efficient ancestry deconvolution. In the first stage, a string-kernel support-vector-machines classifier assigns provisional ancestry labels to short statistically phased genomic segments. In the second stage, an autoregressive pair hidden Markov model corrects phasing errors, smooths local ancestry estimates, and computes confidence scores.Using publicly available datasets and more than 12,000 individuals from the customer database of the personal genetics company, 23andMe, Inc., we have constructed a reference panel containing more than 14,000 unrelated individuals of unadmixed ancestry. We used principal components analysis (PCA) and uniform manifold approximation and projection (UMAP) to identify genetic clusters and define 45 distinct reference populations upon which to train our method. In cross-validation experiments, Ancestry Composition achieves high precision and recall.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

Durand

Chuong

Wilton

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Genome sequences are composed of four nucleotides, typically represented with the letters: A, T, C and G. While the majority of genomic positions are fixed across individuals of the same species, a small fraction is known to be variable. Most of these positions are single-nucleotide polymorphisms (SNPs) that have two variants or forms, which allows for a binary encoding with a common or majority variant (encoded as a zero) shared among the majority of individuals and a minority or alternative variant (encoded as a one) (Avallone et al, 2020;Ioannidis et al, 2020;Kumar et al, 2020;Maples et al, 2013;Thornton and Bermejo, 2014).…”

Section: Genomic Data and Its Applicationsmentioning

confidence: 99%

SALAI-Net: species-agnostic local ancestry inference network

et al. 2022

View full text Add to dashboard Cite

Motivation Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications. Results We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models’ ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods. Availability and implementation We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes). Supplementary information Supplementary data are available from Bioinformatics online.

show abstract

“…Chm-22 and Chm-1 include the same set of individuals, but with only the subset of their genome sequence encoded on chromosome 22 and chromosome 1, respectively, considered. Chm-22-SIM is an augmented version of the Chm-22 data: it contains simulated descendants of the real individuals, created using a July 6, 2021 8/21 recombination simulation program, PyAdmix [23] with the simulations performed independently on the train and validation partitions of Chm-22. A total of 400 individuals per ancestry are generated in the training set and 50 in the validation set.…”

Section: Experiments Datasetsmentioning

confidence: 99%

Neural ADMIXTURE: rapid population clustering with autoencoders

Bustamante

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Characterizing the genetic substructure of large cohorts has become increasingly important as genetic association and prediction studies are extended to massive, increasingly diverse, biobanks. ADMIXTURE and STRUCTURE are widely used unsupervised clustering algorithms for characterizing such ancestral genetic structure. These methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA marker frequencies. The assignments, and clusters, provide an interpretable representation for geneticists to describe population substructure at the sample level. However, with the rapidly increasing size of population biobanks and the growing numbers of variants genotyped (or sequenced) per sample, such traditional methods become computationally intractable. Furthermore, multiple runs with different hyperparameters are required to properly depict the population clustering using these traditional methods, increasing the computational burden. This can lead to days of compute. In this work we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as ADMIXTURE, providing similar (or better) clustering, while reducing the compute time by orders of magnitude. In addition, this network can include multiple outputs, providing the equivalent results as running the original ADMIXTURE algorithm many times with different numbers of clusters. These models can also be stored, allowing later cluster assignment to be performed with a linear computational time.

show abstract

XGMix: Local-Ancestry Inference with Stacked XGBoost

Cited by 8 publications

References 20 publications

A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

SALAI-Net: species-agnostic local ancestry inference network

Neural ADMIXTURE: rapid population clustering with autoencoders

Contact Info

Product

Resources

About