The practical use of graph-based reference genomes depends on the ability to align reads to them.Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. We propose CHOP a method that uses haplotype information to prevent this from happening. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes project. Pangenomes and their graphical representations have become widespread in the domain of sequencing 1 analysis [1]. Part of this adoption is driven by the increased characterization of within species genomic 2 diversity. For instance, recent versions of the human reference genome (GRCh37 and up), include 3 sequences that represent highly polymorphic regions in the human population [2]. 4A pangenome can be constructed by integrating known variants in the linear reference genome. This 5 way, a pangenome can incorporate sequence diversity in ways that a typical linear reference genome 6 cannot. For example aligning reads to a linear reference genome can lead to an over-representation of 7 the reference allele. This effect, known as reference allele bias, influences highly polymorphic regions 8 and/or regions that are absent from the reference [3,4]. By integrating variants into the alignment 9 process, this bias can be reduced [5][6][7]. As a consequence, variant calling can be improved, with fewer 10 erroneous variants induced by misalignments around indels, and fewer missed variants [8]. An intuitive 11 representation for pangenomes are graph data structures, which are often referred to as population 12 graphs [1,9]. Population graphs can be understood as compressed representations of multiple genomes, 13 with sequence generally represented on the nodes. These nodes are in turn connected by directed edges, 14 such that the full sequence of any genome used to construct the graph can be determined by a specific 15 path traversal through the graph. Alternatively, an arbitrary traversal through the graph will yield a 16 mixture of genomes.
17A key application for reference genomes is read alignment. Most of the linear reference read aligners 18 follow a seed-and-extent paradigm, wherein exact matching substrings (seeds) between the read and a 19 reference are used to constrain a local alignment. To efficiently search for exactly matching substrings 20 (seeding), indexing data structures are used. The construction of these indexes generally relies on 21 one of two methods: k-mer-based indexing, where all substrings of length k are stored in a hash-map 22 along with their positions within the sequence; and sorting-based methods such as the Burrows-Wheeler 23 Transform (BWT), where the reference sequence is transformed into a self-index that supports the lookup 24 of exact-matching substrings of arbitrary length.
25Existi...