6Most organisms are more closely related to nearby than distant members of their species, creating 7 spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic 8 sample by comparing it to a set of samples of known geographic origin. Here we describe a deep 9 learning method, which we call Locator, to accomplish this task faster and more accurately than 10 existing approaches. In simulations, Locator infers sample location to within 4.1 generations of 11 dispersal and runs at least an order of magnitude faster than a recent model-based approach. 12 We leverage Locator's computational efficiency to predict locations separately in windows across 13 the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and 14 patterns of geographic mixing that characterize many populations. Applied to whole-genome 15 sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, 16 this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively. 17 Introduction 18In natural populations, local mate selection and dispersal create correlations between geographic 19 location and genetic variation -each individual's genome is a mosaic of material inherited from 20 recent ancestors that are usually geographically nearby. Given a set of genotyped individuals of 21 known geographic provenance, it is therefore possible to predict the location of new samples from 22 genetic information alone (Guillot et al., 2015; Yang et al., 2012; Wasser et al., 2004; Rañola et al., 23 2014; Bhaskar et al., 2016; Baran et al., 2013). This task has forensic applications -for example, 24 estimating the location of trafficked elephant ivory as in Wasser et al. (2004) -and also offers a way 25 to analyze variation in geographic ancestry without assuming the existence of discrete ancestral 26 populations.
27The most common approaches to estimating sample locations are based on unsupervised geno-28 type clustering or dimensionality reduction techniques. Genetic data from samples of both known 29 and unknown origin are jointly analyzed, and unknown samples are assigned to the location of 30 known individuals with which they share a genotype cluster or region of PC space (Breidenbach 31 et al., 2019; Battey et al., 2018; Cong et al., 2019). However, these methods require an additional 32 mapping from genotype clusters or PC space to geography, and can produce nonsensical results if 33 unknown samples are hybrids or do not originate from any of the sampled reference populations.
34Existing methods for estimating sample location that explicitly model continuous landscapes 35 use a two-step procedure. A smoothed map describing variation in allele frequencies over space 36 1 Battey et al. 2019Locator is first estimated for each allele based on the genotypes of individuals with known locations, and 37 locations of new samples are then predicted by maximizing the likelihood of observing a given 38 combination of alleles at the p...