2019
DOI: 10.3390/genes10110886
|View full text |Cite
|
Sign up to set email alerts
|

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

Abstract: (1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 26 publications
0
9
0
Order By: Relevance
“…Almost all BWA-MEM cluster-scaled implementations (SparkBWA [ 8 ], BWASpark [ 9 ], PipeMEM [ 10 ], ADAM [ 7 ], and SparkGA2 [ 6 ]) run multiple BWA-MEM instances on each Spark worker node as Spark tasks, which degrades the underlying efficient single-node multi-threaded scalability of this tool. Instead we use 1 BWA-MEM instance on each Spark worker node, storing output SAM files on storage and merging these SAM files to generate a single output SAM file.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Almost all BWA-MEM cluster-scaled implementations (SparkBWA [ 8 ], BWASpark [ 9 ], PipeMEM [ 10 ], ADAM [ 7 ], and SparkGA2 [ 6 ]) run multiple BWA-MEM instances on each Spark worker node as Spark tasks, which degrades the underlying efficient single-node multi-threaded scalability of this tool. Instead we use 1 BWA-MEM instance on each Spark worker node, storing output SAM files on storage and merging these SAM files to generate a single output SAM file.…”
Section: Methodsmentioning
confidence: 99%
“…pBWA [ 30 ] and mpiBLAST [ 31 ] use MPI, and CUSHAW3 [ 32 ] uses UPC++. Similarly ADAM’s Cannoli [ 7 ], SparkBWA [ 8 ], and PipeMEM [ 10 ] are a few Apache Spark–based BWA implementations that use BWA as loosely integrated underneath these implementations while GATK BWASpark modifies the original BWA to exploit the Spark scheduling and shuffling functionality to run BWA instances in parallel on clusters.…”
Section: Background and Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, Big Data technologies such as Apache Hadoop [4] and Apache Spark [5,6] are being employed. They allow the usage of high-level programming languages, such as Java, Python, or Scala, while providing ease of use and performance [7][8][9][10][11].…”
Section: Introductionmentioning
confidence: 99%
“…Big Data technologies, on the other hand, have become increasingly popular, and their usage is not longer restricted to data analytics, but has been successfully used in fields like bioinformatics [7][8][9][10][11]15], chemistry [29,30], or medicine [31,32]. Technologies like Apache Hadoop [4] or Apache Spark [5] offer a scalable way to process enormous amounts of data in large clusters of "cheap" computers or virtual machines in the cloud, using simple programming models.…”
Section: Introductionmentioning
confidence: 99%