An elegant algorithm for the construction of suffix arrays

Rajasekaran, Sanguthevar; Nicolae, Marius

doi:10.1016/j.jda.2014.03.001

Cited by 14 publications

(9 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All convolutions were performed using the fftw [28] library Version 3.3.3. We used the suffix array algorithm RadixSAof [29]. Figure 1 shows run times for varying the length of the text n. All algorithms scale linearly with the length of the text.…”

Section: Resultsmentioning

confidence: 99%

On String Matching with Mismatches

Nicolae

Rajasekaran

2015

Algorithms

Self Cite

View full text Add to dashboard Cite

Abstract:In this paper, we consider several variants of the pattern matching with mismatches problem. In particular, given a text T = t 1 t 2 · · · t n and a pattern P = p 1 p 2 · · · p m , we investigate the following problems: (1) pattern matching with mismatches: for every i, 1 ≤ i ≤ n − m + 1 output, the distance between P and t i t i+1 · · · t i+m−1 ; and (2) pattern matching with k mismatches: output those positions i where the distance between P and t i t i+1 · · · t i+m−1 is less than a given threshold k. The distance metric used is the Hamming distance. We present some novel algorithms and techniques for solving these problems. We offer deterministic, randomized and approximation algorithms. We consider variants of these problems where there could be wild cards in either the text or the pattern or both. We also present an experimental evaluation of these algorithms. The source code is available at

show abstract

Section: Resultsmentioning

confidence: 99%

On String Matching with Mismatches

Nicolae

Rajasekaran

2015

Algorithms

Self Cite

View full text Add to dashboard Cite

show abstract

“…We implemented k -mer counting using a generalized suffix array and the derived longest common prefix (LCP) array. The generalized suffix array S A is created from the concatenated reads (delimited by special characters such as $) using a linear algorithm [33]. Then, we create the LCP using both the suffix array S A and the reversed suffix array S A ′ [34, 35].…”

Section: Methodsmentioning

confidence: 99%

“…For each position i in the LCP, LCP[i] contains the size of the longest common prefix between S A [ i ] and S A [ i −1]. The key observation [33] for efficient computation of LCP[i] is: for a position j in T , if L C P [ S A ′ [ j −1]] is L , L C P [ S A ′ [ j ]]≥ L −1. The whole LCP array construction takes linear time to the size of T [33].…”

Section: Methodsmentioning

confidence: 99%

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Chen

Sun

2019

BMC Genomics

View full text Add to dashboard Cite

Background Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. Results In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k -mer hits. While using k -mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k -mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k -mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. Conclusions GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK .

show abstract

“…This data structure requires \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${\sim}10$\end{document} bytes/nucleotide. This array is constructed by a new, fast sorting method that is highly scalable ( Rajasekaran and Nicolae 2014 ), having worst case run times of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O(L$\end{document} log \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$L)$\end{document} and usually much better than this in practice. Once the suffix array is sorted, exact \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document} -mer matches form contiguous blocks in the array.…”

Section: Methodsmentioning

confidence: 99%

“…To keep the speed and simplicity of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document} -mer based approaches but retain information about positional homology, we combined and extended several well-tested ideas in new ways ( Gardner and Hall 2013 ; Leimeister and Morgenstern 2014 ; Fan et al 2015 ; Haubold et al 2015 ) and leveraged recent improvements in engineering of a key data structure ( Rajasekaran and Nicolae 2014 ). From a set of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$N$\end{document} genomes, which may be at various stages of assembly, our algorithm builds short multiple sequence alignments, or “ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document} -mer blocks,” starting from approximately matching \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document} -mer “seeds” ( Fig.…”

mentioning

confidence: 99%

Homology-aware Phylogenomics at Gigabase Scales

2017

Self Cite

View full text Add to dashboard Cite

Obstacles to inferring species trees from whole genome data sets range from algorithmic and data management challenges to the wholesale discordance in evolutionary history found in different parts of a genome. Recent work that builds trees directly from genomes by parsing them into sets of small \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}-mer strings holds promise to streamline and simplify these efforts, but existing approaches do not account well for gene tree discordance. We describe a “seed and extend” protocol that finds nearly exact matching sets of orthologous \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}-mers and extends them to construct data sets that can properly account for genomic heterogeneity. Exploiting an efficient suffix array data structure, sets of whole genomes can be parsed and converted into phylogenetic data matrices rapidly, with contiguous blocks of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$k$\end{document}-mers from the same chromosome, gene, or scaffold concatenated as needed. Phylogenetic trees constructed from highly curated rice genome data and a diverse set of six other eukaryotic whole genome, transcriptome, and organellar genome data sets recovered trees nearly identical to published phylogenomic analyses, in a small fraction of the time, and requiring many fewer parameter choices. Our method’s ability to retain local homology information was demonstrated by using it to characterize gene tree discordance across the rice genome, and by its robustness to the high rate of interchromosomal gene transfer found in several rice species.

show abstract

An elegant algorithm for the construction of suffix arrays

Cited by 14 publications

References 31 publications

On String Matching with Mismatches

On String Matching with Mismatches

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Homology-aware Phylogenomics at Gigabase Scales

Contact Info

Product

Resources

About