Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2

Nip, Ka Ming; Hafezqorani, Saber; Gagalova, Kristina K.; Chiu, Readman; Yang, Chen; Rm, Warren; Birol, İnanç

doi:10.1038/s41467-023-38553-y

Cited by 15 publications

(10 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Minstrobes have been used for long-read overlap detection ( Firtina et al 2023 ) and alternating strobe lengths have also been explored ( Maier and Sahlin 2023 ). However, randstrobes were shown to be more sensitive for sequence matching than other methods using fixed strobe lengths (minstrobes and hybridstrobes) ( Sahlin 2021a ), and simpler to construct than alternating strobe lengths (altstrobes and multistrobes) ( Maier and Sahlin 2023 ), and is so far most commonly implemented in practice ( Sahlin 2022 , Nip et al 2023 , Xu et al 2023 ). Therefore, we will consider only the randstrobes method in this study.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Designing efficient randstrobes for sequence similarity analyses

Karami,

Soltani Mohammadi,

Martin

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. Results In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. Availability and implementation All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Randstrobes have been used, e.g. in for short-read mapping ( Sahlin 2022 ), transcriptomic long-read normalization ( Nip et al 2023 ), and read classification ( Xu et al 2023 ). Our recent study also demonstrates that randstrobes provide accurate sequence similarity ranking using the Jaccard distance ( Maier and Sahlin 2023 ).…”

Section: Introductionmentioning

confidence: 99%

Designing efficient randstrobes for sequence similarity analyses

Karami,

Soltani Mohammadi,

Martin

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…As recommended for For RNA-Bloom2 [ 68 ], Pychopper-processed reads were used in the default mode (java -jar RNA-Bloom.jar -long processed.fq -outdir RNA-bloom_out/). We also tested short read-based correction of RNA-bloom2 assembly (java -jar RNA-Bloom.jar -long processed.fq -sef short-read.fastq -outdir RNA-bloom_out/).…”

Section: Comparison Of Transcript Assembly Programsmentioning

confidence: 99%

Merging short and stranded long reads improves transcript assembly

Kainth,

Haddad,

Hall

et al. 2023

PLoS Comput Biol

View full text Add to dashboard Cite

Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to “strand” long-read cDNA libraries that rectifies inaccurate mapping and assembly of long-read transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5’ and 3’ ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.

show abstract

“…RNA-Bloom (Nip et al 2020) is a Java based assembly algorithm originally designed for single-cell short-read transcriptome assembly combining bloom filters with De Bruijn graph assembly. This approach to ab initio transcriptome assembly was improved upon and applied to long-read bulk sequencing data with RNA-Bloom2 (Nip et al 2023). The latter uses digital normalization with strobemers-a strategy reported to be less sensitive to mutations than k-mers-before assembling and polishing unitigs (Sahlin 2021).…”

Section: Introductionmentioning

confidence: 99%

Assembly Arena: Benchmarking RNA isoform reconstruction algorithms for nanopore sequencing

Sagniez,

Budhraja,

Paré

et al. 2024

Preprint

View full text Add to dashboard Cite

Resolving the transcriptomes of higher eukaryotes is more tangible with the advent of long read sequencing, which greatly facilitates the identification of new transcripts and their splicing isoforms. However, the computational analysis of long read RNA sequencing data remains challenging as it is difficult to disentangle technical artifacts frombona fidebiological information. To address this, we evaluated the performance of multiple leading transcriptome assembly algorithms on their ability to accurately reconstruct RNA transcript isoforms. We specifically focused on deep nanopore sequencing of synthetic RNA spike-in controls (Sequins™ and SIRVs) across different chemistries, including cDNA and direct RNA protocols. Our systematic comparative benchmarking exposes the strengths and limitations of the different surveyed strategies. We also highlight conceptual and technical challenges with the annotation of transcriptomes and the formalization of assembly quality metrics. Our results complement similar recent endeavors, helping forge a path towards a gold standard analytical pipeline for long read transcriptome assembly.

show abstract

Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2

Cited by 15 publications

References 52 publications

Designing efficient randstrobes for sequence similarity analyses

Designing efficient randstrobes for sequence similarity analyses

Merging short and stranded long reads improves transcript assembly

Assembly Arena: Benchmarking RNA isoform reconstruction algorithms for nanopore sequencing

Contact Info

Product

Resources

About