Family trees have vast applications in multiple fields from genetics to anthropology and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. Here, we collected 86 million profiles from publicly-available online data from genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of longevity by inspecting millions of relative pairs and to provide insights to population genetics theories on the dispersion of families. We also report a simple digital procedure to overlay other datasets with our resource in order to empower studies with population-scale genealogical data.
One Sentence SummaryUsing massive crowd-sourced genealogy data, we created a population-scale family tree resource for scientific studies.. CC-BY-NC 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/106427 doi: bioRxiv preprint first posted online Feb. 7, 2017; 2
Main TextFamily trees are mathematical graph structures that capture two fundamental processes: mating and parenthood. As such, the edges of the trees represent potential transmission lines for a wide variety of genetic, cultural, socio-demographic, and economic factors in the human population. The foundation of quantitative genetics is built on dissecting the interplay of these factors by overlaying data on family trees and analyzing the correlation of various classes of relatives (1-3). In addition, family trees can serve as a multiplier for genetic information by using study designs that leverage genotype or phenotype data from relatives (4-7), analyzing parent-of-origin effects (8), refining heritability measures (9, 10), or improving individual risk assessment (11)(12)(13). Beyond classical genetic applications, large-scale family trees have played an important role in a wide array of disciplines including human evolution (14,15), anthropology (16), and economics (17,18).Despite the range of applications, constructing population-scale family trees has been a labor-intensive process. So far, the leading approach has mainly relied on local data record repositories such as churches or vital record offices (19)(20)(21). While playing an important role in previous studies, this approach has several limitations (22,23): first, it requires non-trivial resources to digitize the records and organize the data. Second, the resulting trees are usually limited in scope to the specific geographical area of the repository, which precludes comparisons between regions. Finally, vital record data is usually subject to various privacy protections and regulations. This reduces its accessibility and complicates its fusion with other sources such of information such as genomic or health data.Her...