Motivation:The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate.
Results:We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. 1 S-conLSH is at least 2 times faster than the state-of-the-art alignment-based methods. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Single molecule real time (SMRT) sequencing developed by Pacific Biosciences [27] and Oxford nanopore technologies [21] have started to replace previous short length next generation sequencing (NGS) technologies. These new technologies have enabled us to address many unsolved problems regarding genetic variations. With the increase in read length to around 20KB [2], SMRT reads can be used to resolve ambiguities in read mapping caused by repetitive regions. Low GC bias and the ability to detect DNA methylation [27] from native DNA made SMRT data appealing for many real life applications. However, the high sequencing error rate of 13-15% per base [2] poses a real challenge in sequence analysis. Specialized methods like BWA-MEM [15], BLASR [6], rHAT [20], Minimap2 [17], lordFAST [9], etc., have been designed to align noisy long reads back to the respective reference genomes. BLASR [6] clusters the matched words from the reads and genome after indexing using suffix arrays or BWT-FM [28]. It uses a probabilitybased error optimization technique to find the alignment. BWA-MEM [15], originally designed for short read mapping, has been extended for PacBio and Oxford nanopore reads (with option -x pacbio and -x ont2d respectively) by efficient seeding and chaining of short exact matches.However, both methods are too slow to achieve a desired level of sensitivity [20]. This issue was addressed by rHAT [20] using a regional hash table where windows from the reference genome with the highest k-mer matches are chosen as candidate sites for further extension using a direct acyclic graph. Unfortunately, this method has a large memory footprint if used with the default word length of k = 13, and it fails to accom...