Efficient storage and regression computation for population-scale genome sequencing studies

Rivas, Manuel A.; Chang, Christopher

doi:10.1101/2024.04.11.589062

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article1

Preprint1

Relationship

Self Cite0

Independent2

Authors

Journals

Cited by 2 publications

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Enabling efficient analysis of biobank-scale data with genotype representation graphs

DeHaas,

Pan,

Wei

2024

Nat Comput Sci

View full text Add to dashboard Cite

Enabling efficient analysis of biobank-scale data with genotype representation graphs

DeHaas,

Pan,

Wei

2024

Nat Comput Sci

View full text Add to dashboard Cite

Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data

DeHaas,

Pan,

Wei

2024

Preprint

View full text Add to dashboard Cite

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), too large to fit into hard drives in uncompressed form. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging on ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a graph format compresses biobank-scale data to the point where it can fit in a typical server's RAM (5-26GB per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 2000 times smaller than the size of VCF. Moreover, the size of GRG increases sublinearly with the number of samples stored, making it a sustainable solution to the increasing number of samples in large datasets. We show that summaries of genetic variants can be computed on GRG via graph traversal that runs 230 times faster than on VCF. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.

show abstract

Efficient storage and regression computation for population-scale genome sequencing studies

Cited by 2 publications

References 21 publications

Enabling efficient analysis of biobank-scale data with genotype representation graphs

Enabling efficient analysis of biobank-scale data with genotype representation graphs

Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data

Contact Info

Product

Resources

About