SparkLeBLAST: Scalable Parallelization of BLAST Sequence Alignment Using Spark

Youssef, Karim; Feng, Wu-chun

doi:10.1109/ccgrid49817.2020.00-39

Cited by 3 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BLAST implements a highly optimized memory management layer based on memorymapped I/O to read the sequence database. However, recent studies, including [21], have shown that paging significantly degrades BLAST's performance when the database does not fit in memory. While distributing the sequence database across multiple nodes [21], [22] circumvents paging, it introduces high network overhead for processing significantly large output.…”

Section: Discussionmentioning

confidence: 99%

“…However, recent studies, including [21], have shown that paging significantly degrades BLAST's performance when the database does not fit in memory. While distributing the sequence database across multiple nodes [21], [22] circumvents paging, it introduces high network overhead for processing significantly large output. Alternatively, we explore leveraging UMap with optimized parameters to mitigate paging overhead.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

AutoPager: Auto-tuning Memory-Mapped I/O Parameters in Userspace

Youssef

Shah

Gokhale

et al. 2022

2022 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

The exponential growth in dataset sizes has shifted the bottleneck of high-performance data analytics from the compute subsystem to the memory and storage subsystems. This bottleneck has led to the proliferation of non-volatile memory (NVM). To bridge the performance gap between the Linux I/O subsystem and NVM, userspace memory-mapped I/O enables application-specific I/O optimizations. Specifically, UMap, an open-source userspace memory-mapping tool, exposes tunable paging parameters to application users, such as page size and degree of paging concurrency. Tuning these parameters is computationally intractable due to the vast search space and the cost of evaluating each parameter combination. To address this challenge, we present AUTOPAGER, a tool for autotuning userspace paging parameters. Our evaluation, using five data-intensive applications with UMap, shows that AUTOPAGER automatically achieves comparable performance to exhaustive tuning with 10ˆless tuning overhead. and 16.3ˆand 1.52ŝ peedup over UMap with default parameters and UMap with page-size only tuning, respectively.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

AutoPager: Auto-tuning Memory-Mapped I/O Parameters in Userspace

Youssef

Shah

Gokhale

et al. 2022

2022 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

show abstract

“…Given expanding amount of data, providing fast and biologically valuable sequence alignment tools via high-performance computing (HPC) and algorithmic innovations has been a highly active area of bioinformatics research, particularly in the context of rapidly expanding databases. For example, several sequence alignment programs have relied on contributing algorithmic improvements (e.g., HMMER [4], DIAMOND [5], CaBLAST [6]) while others have focused on improving parallelization to take advantage of emerging high-performance computing (HPC) platforms and programming paradigms (e.g., cuBLASTP [7], muBLASTP [8], mpiBLAST [9], SparkBLAST [10], and SparkLeBLAST [11]). Both DIAMOND [5] and CaBLAST [6] improve the execution time of sequence alignment by compressing the sequence database.…”

Section: Introductionmentioning

confidence: 99%

iBLAST: Incremental BLAST of new sequences via automated e-value correction

et al. 2021

Self Cite

View full text Add to dashboard Cite

Search results from local alignment search tools use statistical scores that are sensitive to the size of the database to report the quality of the result. For example, NCBI BLAST reports the best matches using similarity scores and expect values (i.e., e-values) calculated against the database size. Given the astronomical growth in genomics data throughout a genomic research investigation, sequence databases grow as new sequences are continuously being added to these databases. As a consequence, the results (e.g., best hits) and associated statistics (e.g., e-values) for a specific set of queries may change over the course of a genomic investigation. Thus, to update the results of a previously conducted BLAST search to find the best matches on an updated database, scientists must currently rerun the BLAST search against the entire updated database, which translates into irrecoverable and, in turn, wasted execution time, money, and computational resources. To address this issue, we devise a novel and efficient method to redeem past BLAST searches by introducing iBLAST. iBLAST leverages previous BLAST search results to conduct the same query search but only on the incremental (i.e., newly added) part of the database, recomputes the associated critical statistics such as e-values, and combines these results to produce updated search results. Our experimental results and fidelity analyses show that iBLAST delivers search results that are identical to NCBI BLAST at a substantially reduced computational cost, i.e., iBLAST performs (1 + δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We then present three different use cases to demonstrate that iBLAST can enable efficient biological discovery at a much faster speed with a substantially reduced computational cost.

show abstract

DNA Genome Classification with Machine Learning and Image Descriptors

Cussi

Arceda

2023

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

SparkLeBLAST: Scalable Parallelization of BLAST Sequence Alignment Using Spark

Cited by 3 publications

References 21 publications

AutoPager: Auto-tuning Memory-Mapped I/O Parameters in Userspace

AutoPager: Auto-tuning Memory-Mapped I/O Parameters in Userspace

iBLAST: Incremental BLAST of new sequences via automated e-value correction

DNA Genome Classification with Machine Learning and Image Descriptors

Contact Info

Product

Resources

About