Big Data in metagenomics: Apache Spark vs MPI

Abuín, José Manuel; Lopes, Nuno; Ferreira, Luís; Pena, Tomás F.; Schmidt, Bertil

doi:10.1371/journal.pone.0239741

Cited by 12 publications

(8 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, access to high performance computing (HPC) clusters or cloud-based environments would facilitates the processing of metagenomics data. 88 There is a continuous introduction of new technologies and data types expected to be added to the current omics data types which indicates the growing importance of HPC and cloud-based services. 82…”

Section: Integrated Multi-omics Analyses Of Microbial Communitiesmentioning

confidence: 99%

Integrated multi-omics analyses of microbial communities: a review of the current state and future directions

Arıkan

Muth

2023

Mol. Omics

View full text Add to dashboard Cite

show abstract

Section: Integrated Multi-omics Analyses Of Microbial Communitiesmentioning

confidence: 99%

Integrated multi-omics analyses of microbial communities: a review of the current state and future directions

Arıkan

Muth

2023

Mol. Omics

View full text Add to dashboard Cite

show abstract

“…However, the efficiency of MPI-based parallel applications degrades when dealing with large data sets. Moreover, programming with MPI requires programmers to explicitly deal with the individual nodes' status and communication patterns [33]. Finally, failures in MPI are dealt with by using stop-and-restart checkpointing solutions [34].…”

Section: Related Workmentioning

confidence: 99%

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Taamneh

Al-Hami

Bani-Salameh

et al. 2021

Data

View full text Add to dashboard Cite

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.

show abstract

“…As a general-purpose framework, Spark has been widely used for many scientific applications and algorithms. However, there are examples from different areas such as linear algebra [44], genomics [45] or even data science [46] where Spark does not obtain the expected performance.…”

Section: Spark and Hpc Applicationsmentioning

confidence: 99%

A unified framework to improve the interoperability between HPC and Big Data languages and programming models

Piñeiro¹,

Pichel²

2021

Preprint

View full text Add to dashboard Cite

One of the most important issues in the path to the convergence of HPC and Big Data is caused by the differences in their software stacks. Despite some research efforts, the interoperability between their programming models and languages is still limited. To deal with this problem we introduce a new computing framework called IgnisHPC, whose main objective is to unify the execution of Big Data and HPC workloads in the same framework. IgnisHPC has native support for multi-language applications using JVM and non-JVM-based languages. Since MPI was used as its backbone technology, IgnisHPC takes advantage of many communication models and network architectures. Moreover, MPI applications can be directly executed in a efficient way in the framework. The main consequence is that users could combine in the same multi-language code HPC tasks (using MPI) with Big Data tasks (using MapReduce operations). The experimental evaluation demonstrates the benefits of our proposal in terms of performance and productivity with respect to other frameworks such as Apache Spark. IgnisHPC is publicly available for the Big Data and HPC research community.

show abstract

Big Data in metagenomics: Apache Spark vs MPI

Cited by 12 publications

References 33 publications

Integrated multi-omics analyses of microbial communities: a review of the current state and future directions

Integrated multi-omics analyses of microbial communities: a review of the current state and future directions

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

A unified framework to improve the interoperability between HPC and Big Data languages and programming models

Contact Info

Product

Resources

About