2021
DOI: 10.1093/g3journal/jkaa036
|View full text |Cite
|
Sign up to set email alerts
|

Visualizing population structure with variational autoencoders

Abstract: Dimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs)—generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data—for visualizing population genetic variation. VAEs incorporate nonlinear relatio… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
89
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(91 citation statements)
references
References 45 publications
2
89
0
Order By: Relevance
“…Haplotype diversity, evolutionary distances based on the Tajima-Nei model, and population differentiation ( Fst ) were calculated for each group of haplotypes based on plastomic SNVs by using DnaSP v6.0 ( Rozas et al, 2017 ) and MEGA7 ( Kumar et al, 2016 ). The variational autoencoder plotting was calculated with popvae software ( Battey et al, 2021 ). The principal coordinates analysis (PCoA) and multidimensional scaling (MDS) were conducted in TASSEL 5 ( Bradbury et al, 2007 ).…”
Section: Methodsmentioning
confidence: 99%
“…Haplotype diversity, evolutionary distances based on the Tajima-Nei model, and population differentiation ( Fst ) were calculated for each group of haplotypes based on plastomic SNVs by using DnaSP v6.0 ( Rozas et al, 2017 ) and MEGA7 ( Kumar et al, 2016 ). The variational autoencoder plotting was calculated with popvae software ( Battey et al, 2021 ). The principal coordinates analysis (PCoA) and multidimensional scaling (MDS) were conducted in TASSEL 5 ( Bradbury et al, 2007 ).…”
Section: Methodsmentioning
confidence: 99%
“…Some early efforts used machine learning to account for issues that arise with high‐dimensional summary statistics (Blum & François, 2010 ; Sheehan & Song, 2016 ; Ronen et al, 2013 ). More recently, machine learning approaches have used various forms of convolutional, recurrent and ‘deep’ neural networks to improve inference and visualization (Adrion et al, 2020 ; Battey et al, 2021 ; Gower et al, 2020 ; Flagel et al, 2019 ; Sanchez et al, 2020 ; Torada et al, 2019 ; Chan et al, 2018 ). One of the goals of moving to these approaches was to enable inference frameworks to operate on the ‘raw’ data (genotype matrices), which avoids the loss of information that comes from reducing genotypes to summary statistics.…”
Section: Introductionmentioning
confidence: 99%
“…We see this as an inherent problem relating to data structure. Previous comparisons of t-SNE found low fidelity with global data patterns, and latent space distances were poor proxies for "true" among-group distances, particularly when compared to VAE (Battey et al, 2020;Becht et al, 2019). This potentially explains our observed "plateau" of mean optimal K and SD in the t-SNE perplexity grid-search, in that perplexity defines relative weighting of local versus global components (Wattenberg et al, 2016).…”
Section: Relative Performance Of Species-delimitation Methodsmentioning
confidence: 78%