Background: Graph-based representation of genome assemblies has been recently used in different applications -from gene finding to haplotype separation. While most of these applications are based on the alignment of molecular sequences to assembly graphs, existing software tools for finding such alignments have important limitations. Results: We present a novel SPAligner tool for aligning long diverged molecular sequences to assembly graphs and demonstrate that SPAligner is an efficient solution for mapping third generation sequencing data and can also facilitate the identification of known genes in complex metagenomic datasets. Conclusions: Our work will facilitate accelerating the development of graph-based approaches in solving sequence to genome assembly alignment problem. SPAligner is implemented as a part of SPAdes tools library and is available on https://github.com/ablab/spades/archive/spaligner-paper.zip.
BackgroundMany popular short read assemblers [1, 2, 3] provide the user not only with a set of contig sequences, but also with assembly graphs, encoding the information on the potential adjacencies of the assembled sequences. Naturally arising problem of sequence-to-graph alignment has been a topic of many recent studies [4,5,6,7,8]. Identifying alignments of long error-prone reads (such as Pacbio and ONT reads) to assembly graphs is particularly important and has recently been applied to hybrid genome assembly [9,10], read error correction [11], and haplotype separation [12]. At the same time, the choice of the practical aligners supporting long nucleotide sequences is currently limited to vg [4] and GraphAligner [13], both of which are under active development. Moreover, to the best of our knowledge, no existing graph-based aligner supports alignment of amino acid sequences.Here we present the SPAligner (Saint Petersburg Aligner) tool for aligning long diverged molecular (both nucleotide and amino acid) sequences against assembly graphs produced by the popular short-read assemblers. The project stemmed from our previous efforts on the long-read alignment within the hybridSPAdes assembler [9]. Our benchmarks on