rHAT: fast alignment of noisy long reads with regional hashing

Liu, Bo; Guan, Dengfeng; Teng, Mingxiang; Wang, Yadong

doi:10.1093/bioinformatics/btv662

Cited by 37 publications

(44 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If B gets more votes, then it is more likely to be clone code of A. The idea is similar to the idea of seed-and-extend in the sequencing alignment [8]. The threshold for SR A (B) is θ (line 17).…”

Section: Filtering Via the Common Seeds Numbermentioning

confidence: 99%

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

Wang

Yin

et al. 2020

IEEE Access

View full text Add to dashboard Cite

To detect large-variance code clones (i.e. clones with relatively more differences) in large-scale code repositories is difficult because most current tools can only detect almost identical or very similar clones. It will make promotion and changes to some software applications such as bug detection, code completion, software analysis, etc. Recently, CCAligner made an attempt to detect clones with relatively concentrated modifications called large-gap clones. Our contribution is to develop a novel and effective detection approach of large-variance clones to more general cases for not only the concentrated code modifications but also the scattered code modifications. A detector named LVMapper is proposed, borrowing and changing the approach of sequencing alignment in bioinformatics which can find two similar sequences with more differences. The ability of LVMapper was tested on both self-synthetic datasets and real cases, and the results show substantial improvement in detecting large-variance clones compared with other state-of-the-art tools including CCAligner. Furthermore, our new tool also presents good recall and precision for general Type-1, Type-2 and Type-3 clones on the widely used benchmarking dataset, BigCloneBench.

show abstract

Section: Filtering Via the Common Seeds Numbermentioning

confidence: 99%

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

Wang

Yin

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…conLSH [8] . The aligner, rHAT [20] has been excluded from the study, as it has been reported to malfunction in certain scenarios [17]. The PacBio read alignment module of BWA-MEM [15]…”

Section: Mapper Command Line Settingsmentioning

confidence: 99%

“…However, the high sequencing error rate of 13-15% per base [2] poses a real challenge in sequence analysis. Specialized methods like BWA-MEM [15], BLASR [6], rHAT [20], Minimap2 [17], lordFAST [9], etc., have been designed to align noisy long reads back to the respective reference genomes. BLASR [6] clusters the matched words from the reads and genome after indexing using suffix arrays or BWT-FM [28].…”

Section: Introductionmentioning

confidence: 99%

“…However, both methods are too slow to achieve a desired level of sensitivity [20]. This issue was addressed by rHAT [20] using a regional hash table where windows from the reference genome with the highest k-mer matches are chosen as candidate sites for further extension using a direct acyclic graph. Unfortunately, this method has a large memory footprint if used with the default word length of k = 13, and it fails to accommodate longer k-mers to resolve repeats.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

S-conLSH: Alignment-free gapped mapping of noisy long reads

Chakraborty

Morgenstern

Bandyopadhyay

2019

Preprint

View full text Add to dashboard Cite

Motivation:The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results:We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. 1 S-conLSH is at least 2 times faster than the state-of-the-art alignment-based methods. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Single molecule real time (SMRT) sequencing developed by Pacific Biosciences [27] and Oxford nanopore technologies [21] have started to replace previous short length next generation sequencing (NGS) technologies. These new technologies have enabled us to address many unsolved problems regarding genetic variations. With the increase in read length to around 20KB [2], SMRT reads can be used to resolve ambiguities in read mapping caused by repetitive regions. Low GC bias and the ability to detect DNA methylation [27] from native DNA made SMRT data appealing for many real life applications. However, the high sequencing error rate of 13-15% per base [2] poses a real challenge in sequence analysis. Specialized methods like BWA-MEM [15], BLASR [6], rHAT [20], Minimap2 [17], lordFAST [9], etc., have been designed to align noisy long reads back to the respective reference genomes. BLASR [6] clusters the matched words from the reads and genome after indexing using suffix arrays or BWT-FM [28]. It uses a probabilitybased error optimization technique to find the alignment. BWA-MEM [15], originally designed for short read mapping, has been extended for PacBio and Oxford nanopore reads (with option -x pacbio and -x ont2d respectively) by efficient seeding and chaining of short exact matches.However, both methods are too slow to achieve a desired level of sensitivity [20]. This issue was addressed by rHAT [20] using a regional hash table where windows from the reference genome with the highest k-mer matches are chosen as candidate sites for further extension using a direct acyclic graph. Unfortunately, this method has a large memory footprint if used with the default word length of k = 13, and it fails to accom...

show abstract

“…According to Chou's5 -step rule [17] and demonstrated in as eries of recent publications, [18][19][20][21][22][23][24][25][26] to presentas tatistical predictorf or ab iological system with ac lear logic and application value, we need to consider the following guidelines: ( 1) how to construct or select av alid benchmark dataset to train and test the predictor; (2) how to formulate the biological sequence samples with an effective mathematical expression thatc an truly reflect their intrinsic correlation with the targett ob ep redicted; (3)h ow to introduce or develop ap owerful algorithm (or engine) to operate the prediction;( 4) how to properly perform cross-validation tests to objectively evaluatei ts anticipated accuracy; ( 5) how to establish au ser-friendlyw eb-server that is accessible to the public. Below, let us describet he five procedures one-by-one.…”

mentioning

confidence: 99%

iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory

Qiu

Sun

Xiao

et al. 2016

Molecular Informatics

View full text Add to dashboard Cite

Protein phosphorylation plays a critical role in human body by altering the structural conformation of a protein, causing it to become activated/deactivated, or functional modification. Given an uncharacterized protein sequence, can we predict whether it may be phosphorylated or may not? This is no doubt a very meaningful problem for both basic research and drug development. Unfortunately, to our best knowledge, so far no high throughput bioinformatics tool whatsoever has been developed to address such a very basic but important problem due to its extremely complexity and lacking sufficient training data. Here we proposed a predictor called iPhos-PseEvo by (1) incorporating the protein sequence evolutionary information into the general pseudo amino acid composition (PseAAC) via the grey system theory, (2) balancing out the skewed training datasets by the asymmetric bootstrap approach, and (3) constructing an ensemble predictor by fusing an array of individual random forest classifiers thru a voting system. Rigorous jackknife tests have indicated that very promising success rates have been achieved by iPhos-PseEvo even for such a difficult problem. A user-friendly web-server for iPhos-PseEvo has been established at http://www.jci-bioinfo.cn/iPhos-PseEvo, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the formulation and approach presented here can be used to analyze many other problems in protein science as well.

show abstract

rHAT: fast alignment of noisy long reads with regional hashing

Abstract: Supplementary data are available at Bioinformatics online.

Cited by 37 publications

References 24 publications

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

LVMapper: A Large-Variance Clone Detector Using Sequencing Alignment Approach

S-conLSH: Alignment-free gapped mapping of noisy long reads

iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory

Contact Info

Product

Resources

About