A fast adaptive algorithm for computing whole-genome homology maps

Jain, Chirag; Koren, Sergey; Dilthey, Alexander; Phillippy, Adam M.; Aluru, Srinivas

doi:10.1093/bioinformatics/bty597

Cited by 140 publications

(137 citation statements)

References 39 publications

Supporting

Mentioning

137

Contrasting

Order By: Relevance

“…HLL lacks another advantage of MinHash; when Min-Hash is used in conjunction with a reversible hash function, it can be used not only to calculate the relevant set cardinalities but also to report the k-mers common between the sets. This can provide crucial hints when the eventual goal is to map a read to (or near) its point of origin with respect to the reference, as is the goal for tools like MashMap [5].…”

Section: Discussionmentioning

confidence: 99%

“…Since the release of the seminal Mash tool [1], data sketches such as MinHash have become instrumental in comparative genomics. They are used to cluster genomes from large databases [1], search for datasets with certain sequence content [2], accelerate the overlapping step in genome assemblers [3,4], map sequencing reads [5], and find similarity thresholds characterizing species-level distinctions [6]. Whereas MinHash was originally developed to find similar web pages [7], here it is being used to summarize large genomic sequence collections such as reference genomes or sequencing datasets.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dashing: fast and accurate genomic distances with HyperLogLog

Baker¹,

Langmead²

2019

Genome Biol

View full text Add to dashboard Cite

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Dashing: fast and accurate genomic distances with HyperLogLog

Baker¹,

Langmead²

2019

Genome Biol

View full text Add to dashboard Cite

show abstract

“…To obtain similar sequences within a reference, we mapped the spliced transcript sequences against a version of the genome where all exon segments were hard-masked (i.e., replaced with N). We performed this mapping using MashMap [20], with a segment size 500 and minimum percent identity of 80%. The sequence similar regions were merged (per-chromosome) using BedTools [45] and concatenated, giving a decoy sequence for each chromosome.…”

Section: Decoy Sequencesmentioning

confidence: 99%

“…We also attempt to address one of the failure modes of direct alignment against the transcriptome, compared to spliced alignment to the genome: when a sequenced fragment originates from an unannotated genomic locus bearing sequence similarity to an annotated transcript, it can be falsely mapped to the annotated transcript since the relevant genomic sequence is not available to the method. We describe a procedure that makes use of MashMap [20] to identify and extract such sequence similar decoy regions from the genome. The normal Salmon index is then augmented with these decoy sequences, which are handled in a special manner during mapping and alignment scoring, leading to a reduction in such cases of false mappings.…”

Section: Introductionmentioning

confidence: 99%

Alignment and mapping methodology influence transcript abundance estimation

et al. 2020

View full text Add to dashboard Cite

Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.

show abstract

“…[77]. Wholegenome alignment was computed by MashMap (https://github.com/marbl/MashMap) employing default settings, and was visualized as a dot plot [78].…”

Section: Genome Assembly Analysis Of Genomic Features and Synteny Comentioning

confidence: 99%

Long transposon-rich centromeres in an oomycete reveal divergence of centromere features in Stramenopila-Alveolata-Rhizaria lineages

et al. 2020

View full text Add to dashboard Cite

Centromeres are chromosomal regions that serve as platforms for kinetochore assembly and spindle attachments, ensuring accurate chromosome segregation during cell division. Despite functional conservation, centromere DNA sequences are diverse and often repetitive, making them challenging to assemble and identify. Here, we describe centromeres in an oomycete Phytophthora sojae by combining long-read sequencing-based genome assembly and chromatin immunoprecipitation for the centromeric histone CENP-A followed by high-throughput sequencing (ChIP-seq). P. sojae centromeres cluster at a single focus at different life stages and during nuclear division. We report an improved genome assembly of the P. sojae reference strain, which enabled identification of 15 enriched CENP-A binding regions as putative centromeres. By focusing on a subset of these regions, we demonstrate that centromeres in P. sojae are regional, spanning 211 to 356 kb. Most of these regions are transposon-rich, poorly transcribed, and lack the histone modification H3K4me2 but are embedded within regions with the heterochromatin marks H3K9me3 and H3K27me3. Strikingly, we discovered a Copia-like transposon (CoLT) that is highly enriched in the CENP-A chromatin. Similar clustered elements are also found in oomycete relatives of P. sojae, and may be applied as a criterion for prediction of oomycete centromeres. This work reveals a divergence of centromere features in oomycetes as compared to other organisms in the Stramenopila-Alveolata-Rhizaria (SAR) supergroup including diatoms and Plasmodium falciparum that have relatively short and simple regional centromeres. Identification of P. sojae centromeres in turn also advances the genome assembly.

show abstract

A fast adaptive algorithm for computing whole-genome homology maps

Cited by 140 publications

References 39 publications

Dashing: fast and accurate genomic distances with HyperLogLog

Dashing: fast and accurate genomic distances with HyperLogLog

Alignment and mapping methodology influence transcript abundance estimation

Long transposon-rich centromeres in an oomycete reveal divergence of centromere features in Stramenopila-Alveolata-Rhizaria lineages

Contact Info

Product

Resources

About