Visualizing population structure with variational autoencoders

Battey, C. J.; Coffing, Gabrielle C.; Kern, Andrew D.

doi:10.1093/g3journal/jkaa036

Cited by 71 publications

(91 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Haplotype diversity, evolutionary distances based on the Tajima-Nei model, and population differentiation ( Fst ) were calculated for each group of haplotypes based on plastomic SNVs by using DnaSP v6.0 ( Rozas et al, 2017 ) and MEGA7 ( Kumar et al, 2016 ). The variational autoencoder plotting was calculated with popvae software ( Battey et al, 2021 ). The principal coordinates analysis (PCoA) and multidimensional scaling (MDS) were conducted in TASSEL 5 ( Bradbury et al, 2007 ).…”

Section: Methodsmentioning

confidence: 99%

The History and Diversity of Rice Domestication as Resolved From 1464 Complete Plastid Genomes

Chen

Xiang

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

The plastid is an essential organelle in autotrophic plant cells, descending from free-living cyanobacteria and acquired by early eukaryotic cells through endosymbiosis roughly one billion years ago. It contained a streamlined genome (plastome) that is uniparentally inherited and non-recombinant, which makes it an ideal tool for resolving the origin and diversity of plant species and populations. In the present study, a large dataset was amassed by de novo assembling plastomes from 295 common wild rice (Oryza rufipogon Griff.) and 1135 Asian cultivated rice (Oryza sativa L.) accessions, supplemented with 34 plastomes from other Oryza species. From this dataset, the phylogenetic relationships and biogeographic history of O. rufipogon and O. sativa were reconstructed. Our results revealed two major maternal lineages across the two species, which further diverged into nine well supported genetic clusters. Among them, the Or-wj-I/II/III and Or-wi-I/II genetic clusters were shared with cultivated (percentage for each cluster ranging 54.9%∼99.3%) and wild rice accessions. Molecular dating, phylogeographic analyses and reconstruction of population historical dynamics indicated an earlier origin of the Or-wj-I/II genetic clusters from East Asian with at least two population expansions, and later origins of other genetic clusters from multiple regions with one or more population expansions. These results supported a single origin of japonica rice (mainly in Or-wj-I/II) and multiple origins of indica rice (in all five clusters) for the history of rice domestication. The massive plastomic data set presented here provides an important resource for understanding the history and evolution of rice domestication as well as a genomic resources for use in future breeding and conservation efforts.

show abstract

Section: Methodsmentioning

confidence: 99%

The History and Diversity of Rice Domestication as Resolved From 1464 Complete Plastid Genomes

Chen

Xiang

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

show abstract

“…Some early efforts used machine learning to account for issues that arise with high‐dimensional summary statistics (Blum & François, 2010 ; Sheehan & Song, 2016 ; Ronen et al, 2013 ). More recently, machine learning approaches have used various forms of convolutional, recurrent and ‘deep’ neural networks to improve inference and visualization (Adrion et al, 2020 ; Battey et al, 2021 ; Gower et al, 2020 ; Flagel et al, 2019 ; Sanchez et al, 2020 ; Torada et al, 2019 ; Chan et al, 2018 ). One of the goals of moving to these approaches was to enable inference frameworks to operate on the ‘raw’ data (genotype matrices), which avoids the loss of information that comes from reducing genotypes to summary statistics.…”

Section: Introductionmentioning

confidence: 99%

Automatic inference of demographic parameters using generative adversarial networks

Wang

Kourakos

et al. 2021

Molecular Ecology Resources

View full text Add to dashboard Cite

Simulation is a key component of population genetics. It helps to train our intuition and is important for the development, testing and comparison of inference methods. Because population genetic models such as the ancestral recombination and selection graphs (Griffiths & Marjoram, 1997;Neuhauser & Krone, 1997) are computationally intractable for inference but relatively easy to simulate, simulations are also heavily used for parameter inference.Approximate Bayesian Computation (ABC; Beaumont et al., 2002) is a widely used example. Regardless of the application, the goal is to simulate data that is 'realistic' in the sense that it resembles real data from the population(s) of interest. Typically this is done by fixing some parameters that are fairly well-known, then choosing other parameters to match some property of the real data, usually based on summary statistics. However, this involves a potential loss of information in the reduction in summary statistics and then an implicit weighting on the relative importance of different summary statistics. Often, parameters that create simulations that match one type of summary statistic (e.g. the site frequency spectrum) do not match others (e.g. linkage disequilibrium patterns; Beichman et al., 2017). Here, we present a novel parameter learning approach using Generative Adversarial Networks (GANs). Our approach creates both realistic simulated data and a quantitative way of determining

show abstract

“…We see this as an inherent problem relating to data structure. Previous comparisons of t-SNE found low fidelity with global data patterns, and latent space distances were poor proxies for "true" among-group distances, particularly when compared to VAE (Battey et al, 2020;Becht et al, 2019). This potentially explains our observed "plateau" of mean optimal K and SD in the t-SNE perplexity grid-search, in that perplexity defines relative weighting of local versus global components (Wattenberg et al, 2016).…”

Section: Relative Performance Of Species-delimitation Methodsmentioning

confidence: 78%

The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapene spp.)

Martin

Chafin

Douglas

et al. 2021

Molecular Ecology Resources

View full text Add to dashboard Cite

Model-based approaches that attempt to delimit species are hampered by computational limitations as well as the unfortunate tendency by users to disregard algorithmic assumptions. Alternatives are clearly needed, and machine-learning (M-L) is attractive in this regard as it functions without the need to explicitly define a species concept. Unfortunately, its performance will vary according to which (of several) bioinformatic parameters are invoked. Herein, we gauge the effectiveness of M-L-based species-delimitation algorithms by parsing 64 variably-filtered versions of a ddRADderived SNP data set collected from North American box turtles (Terrapene spp.). Our filtering strategies included: (i) minor allele frequencies (MAF) of 5%, 3%, 1%, and 0% (= none), and (ii) maximum missing data per-individual/per-population at 25%, 50%, 75%, and 100% (= no filtering). We found that species-delimitation via unsupervised M-L impacted the signal-to-noise ratio in our data, as well as the discordance among resolved clades. The latter may also reflect biogeographic history, gene flow, incomplete lineage sorting, or combinations thereof (as corroborated from previously observed patterns of differential introgression). Our results substantiate M-L as a viable species-delimitation method, but also demonstrate how commonly observed patterns of phylogenetic discordance can seriously impact M-L-classification.

show abstract

Visualizing population structure with variational autoencoders

Cited by 71 publications

References 45 publications

The History and Diversity of Rice Domestication as Resolved From 1464 Complete Plastid Genomes

The History and Diversity of Rice Domestication as Resolved From 1464 Complete Plastid Genomes

Automatic inference of demographic parameters using generative adversarial networks

The choices we make and the impacts they have: Machine learning and species delimitation in North American box turtles (Terrapene spp.)

Contact Info

Product

Resources

About