Angana Chakraborty scite author profile

2013

Sci Rep

In this article we propose a Fast Optimal Global Sequence Alignment Algorithm, FOGSAA, which aligns a pair of nucleotide/protein sequences faster than any optimal global alignment method including the widely used Needleman-Wunsch (NW) algorithm. FOGSAA is applicable for all types of sequences, with any scoring scheme, and with or without affine gap penalty. Compared to NW, FOGSAA achieves a time gain of (70–90)% for highly similar nucleotide sequences (> 80% similarity), and (54–70)% for sequences having (30–80)% similarity. For other sequences, it terminates with an approximate score. For protein sequences, the average time gain is between (25–40)%. Compared to three heuristic global alignment methods, the quality of alignment is improved by about 23%–53%. FOGSAA is, in general, suitable for aligning any two sequences defined over a finite alphabet set, where the quality of the global alignment is of supreme importance.

conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

2019

Preprint

Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has O(n ρ+1 ) space requirement, where n is the number of sequences in the corpus and ρ is a constant. The indexing time and querying time are bounded by O( n ρ+1 ·ln n ln 1 P 2 ) and O(n ρ ) respectively, where P 2 > 0, is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately 24.2% less processing time, while saving about 70.3% in peak memory requirement for H.sapiens PacBio dataset. 3/11 11/11

S-conLSH: Alignment-free gapped mapping of noisy long reads

Morgenstern

2019

Preprint

Motivation:The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results:We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. 1 S-conLSH is at least 2 times faster than the state-of-the-art alignment-based methods. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Single molecule real time (SMRT) sequencing developed by Pacific Biosciences [27] and Oxford nanopore technologies [21] have started to replace previous short length next generation sequencing (NGS) technologies. These new technologies have enabled us to address many unsolved problems regarding genetic variations. With the increase in read length to around 20KB [2], SMRT reads can be used to resolve ambiguities in read mapping caused by repetitive regions. Low GC bias and the ability to detect DNA methylation [27] from native DNA made SMRT data appealing for many real life applications. However, the high sequencing error rate of 13-15% per base [2] poses a real challenge in sequence analysis. Specialized methods like BWA-MEM [15], BLASR [6], rHAT [20], Minimap2 [17], lordFAST [9], etc., have been designed to align noisy long reads back to the respective reference genomes. BLASR [6] clusters the matched words from the reads and genome after indexing using suffix arrays or BWT-FM [28]. It uses a probabilitybased error optimization technique to find the alignment. BWA-MEM [15], originally designed for short read mapping, has been extended for PacBio and Oxford nanopore reads (with option -x pacbio and -x ont2d respectively) by efficient seeding and chaining of short exact matches.However, both methods are too slow to achieve a desired level of sensitivity [20]. This issue was addressed by rHAT [20] using a regional hash table where windows from the reference genome with the highest k-mer matches are chosen as candidate sites for further extension using a direct acyclic graph. Unfortunately, this method has a large memory footprint if used with the default word length of k = 13, and it fails to accom...

conLSH: Context based Locality Sensitive Hashing for mapping of noisy SMRT reads

Computational Biology and Chemistry

2020

Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has O(n ρ+1 ) space requirement, where n is the number of sequences in the corpus and ρ is a constant. The indexing time and querying time are bounded by O( n ρ+1 •ln nand O(n ρ ) respectively, where P 2 > 0, is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately 24.2% less processing time, while saving about 70.3% in peak memory requirement for H.sapiens PacBio dataset..

Clustering of web sessions by FOGSAA

2013