The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l-j7d0g-swtofxa2rct8495.
RESULTSAll results described here may be found, replicated, and rerun on different data using Arvados at