SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Vineetha, V.; Biji, C. L.; Nair, Achuthsankar S.

doi:10.1038/s41598-019-42966-5

Cited by 7 publications

(3 citation statements)

References 26 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To deal with the computational complexity of this task, various heuristics have been proposed in the literature. SPARK-MSNA, is an MSA algorithm on Spark proposed by Vineetha et al (2019) [56] . The algorithm uses both suffix tree and a modified Needleman-Wunsch algorithm.…”

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

Framing Apache Spark in life sciences

Manconi¹,

Gnocchi²,

Milanesi³

et al. 2023

Heliyon

View full text Add to dashboard Cite

Section: Apache Spark In Life Sciencesmentioning

confidence: 99%

Framing Apache Spark in life sciences

Manconi¹,

Gnocchi²,

Milanesi³

et al. 2023

Heliyon

View full text Add to dashboard Cite

“…SparkBAW aims to boost the process of the alignment phase in the DNA sequence analysis by targeting the shortread mapping. Another multiple sequence alignment Sparkbased implementation are PASTASpark [26] and [27] with a with supervised learning approach.Also, utilizing in-memory data analytics applications that process columnar data as for ArrowSAM [29] that employes Apache Arrow reported in the literature. In PipeMEM [33], a pipeline parallel pattern that ensures no local disk access, the authors optimized the computation phase by employing standard stream and PipeRDD.…”

Section: Related Workmentioning

confidence: 99%

SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis

Filgueira

Awaysheh

Carter

et al. 2022

2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

View full text Add to dashboard Cite

The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions ondemand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper's findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.

show abstract

“…In [26], the existing algorithms such as Needleman-Wunsch, Smith-Waterman, and BLAST are employed along with Hadoop or other big data technologies to scale down the time, memory memory memory and the CPU consumption. Spark MSNA (Multiple Sequence Nucleotide Alignment) services are used to compare the suffix tree approach [27]. FASTdoop [28] is able to load the FASTA and FASTQ input files for bioinformatics applications on the MapReduce framework.…”

Section: Related Workmentioning

confidence: 99%

BitmapAligner: Bit-Parallelism String Matching with MapReduce and Hadoop

Aksa¹,

Rashid²,

Nisar³

et al. 2021

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Advancements in next-generation sequencer (NGS) platforms have improved NGS sequence data production and reduced the cost involved, which has resulted in the production of a large amount of genome data. The downstream analysis of multiple associated sequences has become a bottleneck for the growing genomic data due to storage and space utilization issues in the domain of bioinformatics. The traditional string-matching algorithms are efficient for small sized data sequences and cannot process large amounts of data for downstream analysis. This study proposes a novel bit-parallelism algorithm called BitmapAligner to overcome the issues faced due to a large number of sequences and to improve the speed and quality of multiple sequence alignment (MSA). The input files (sequences) tested over BitmapAligner can be easily managed and organized using the Hadoop distributed file system. The proposed aligner converts the test file (the whole genome sequence) into binaries of an equal length of the sequence, line by line, before the sequence alignment processing. The Hadoop distributed file system splits the larger files into blocks, based on a defined block size, which is 128 MB by default. BitmapAligner can accurately process the sequence alignment using the bitmask approach on large-scale sequences after sorting the data. The experimental results indicate that BitmapAligner operates in real time, with a large number of sequences. Moreover, BitmapAligner achieves the exact start and end positions of the pattern sequence to test the MSA application in the whole genome query sequence. The MSA's accuracy is verified by the bitmask indexing property of the bit-parallelism extended shifts (BXS) algorithm. The dynamic and exact approach of the BXS algorithm is implemented through the MapReduce function of Apache Hadoop. Conversely, the traditional seeds-and-extend approach faces the risk of errors while identifying the pattern sequences' positions. Moreover, the proposed model resolves the largescale data challenges that are covered through MapReduce in the Hadoop framework. Hive, Yarn, HBase, Cassandra, and many other pertinent flavors

show abstract

SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

Cited by 7 publications

References 26 publications

Framing Apache Spark in life sciences

Framing Apache Spark in life sciences

SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis

BitmapAligner: Bit-Parallelism String Matching with MapReduce and Hadoop

Contact Info

Product

Resources

About