Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that effectively mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application. set). ADMIXTURE was originally proposed as a method for unsupervised model-based estimation of ancestry of unrelated individuals 40 . This is the most widely used version of ADMIXTURE but, in fact, it was extended 48 for supervised learning in such a way that it can use prior knowledge on the population of origin of some individuals to infer the ancestry of other individuals. The supervised learning version of ADMIXTURE, however, was not designed to estimate the probability that individuals were sampled from a certain source, i.e., it was not designed to attribute individuals to sources but rather to infer their ancestry. In spite of that, one would expect some relationship between ancestry and source of individuals and it makes sense to explore the capability of ADMIXTURE as an attribution method (with applicability restricted to datasets consisting of SNP genotypes). GLOBETROTTER, another package to infer the ancestry of individuals, also has potential as a method for source attribution with extended SNP datasets 49 .Besides developing efficient methods for source attribution, selection of loci with high discriminatory power can also help deal with the computational challenge posed by extended genotypes. Several methods have been proposed to rank markers according to their importance for source attribution based on the intuitive idea that highly polymorphic markers should allow for higher genetic differentiation 50 . This can be achieved by measuring the importance of loci with diversity indices (e.g. expected heterozygosity, fixation index F ST or informativeness 5, 7,21,51,52 ). Other approaches propose focusing on the joint performance of sets of loci rather than considering performance of loci individually 53-55 . One would expect these approaches to be more appropriate than diversity-based methods when dealing with correlated markers (i.e. when linkage disequilibrium is important 56 ). However, they are computationally intensive and impractical to deal with extended genotypes and do not always improve on diversity-based methods 7 .Here, we address two of the challenges posed by extended genotypes for source attribution. First, we propose a fast method for source attribution which can deal with genotypes comprising thousands of loci with minimal computational effort. S...