MotivationDe Bruijn graphs play an essential role in computational biology, facilitating rapid alignment-free comparison of genomic datasets as well as reconstruction of underlying genomic sequences. Subsequently, an important question is how to efficiently represent, compress, and transmit de Bruijn graphs of the most common types of genomic data sets, such as sequencing reads, genomes, and pan-genomes.
ResultsWe introduce simplitigs, an effective representation of de Bruijn graphs for alignment-free applications. Simplitigs are a generalization of unitigs and correspond to spellings of vertex-disjoint paths in a de Bruijn graph. We present an easy-to-plug-in greedy heuristic for their computation and provide a reference implementation in a program called ProphAsm. We use ProphAsm to compare the scaling of simplitigs and unitigs on a range of genomic datasets. We demonstrate that simplitigs are superior to unitigs in terms of the cumulative sequence length as well as of the number of sequences, and that they are sufficiently close to the theoretical bounds for practical applications. Finally, we demonstrate that, when combined with standard full-text indexes, simplitigs provide a scalable solution for k-mer search in pan-genomes.
AvailabilityProphAsm is written in C++ and is available under the MIT license from De Bruijn graphs belong to the most popular graph representations of genomic datasets. They are defined as directed graphs where V is the set of all k-mers (i.e., subwords of a fixed length k) occurring in the V , ) G = ( E dataset with edges connecting a vertex v to a vertex w if there is a long prefix-suffix overlap between these v k − 1 and w. As follows from the definition, we can associate a de Bruijn graph with the underlying k-mer set and edges can be defined implicitly (unlike the edge-centric definition where k-mer sets are associated with edges [5] ). In this paper, we consider only vertex-centric graphs.De Bruijn graphs feature remarkable properties. First, their computation from data is easy and deterministic.Algorithms for enumerating and counting k-mers have been extensively studied and many programs are available [6][7][8][9] . If the datasets contain sequencing errors, the computation may also involve graph cleaning. This aims at removing those k-mers that are the result of sequencing errors and are due to their supposed randomness expected to be rare. Second, if k is chosen appropriately, de Bruijn graphs can capture substantial information about the entire molecules under sequencing as these correspond to (some of the) walks in the graphs, provided that sequencing was sufficiently deep. Third, de Bruijn graphs can be handled easily, which simplifies software development as well as dataset analysis and interpretation. These properties have led to a large variety of applications of de Bruijn graphs.De Bruijn graphs have been widely studied in the context of sequence assembly [10][11][12] . Here, their construction is typically the first step to the reconstruction of the genomes and transcr...