Evaluating cloud frameworks on genomic applications

Bertoni, Michele; Ceri, Stefano; Kaitoua, Abdurrahman; Pinoli, Pietro

doi:10.1109/bigdata.2015.7363756

Cited by 22 publications

(20 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparative analysis, published in [9] and [10], shows that the performance of Flink and Spark are remarkably similar, while the performance of Spark and SciDB are very different, with SciDB faster then Spark when operations involve selections and aggregates (as they are facilitated by an array organization); whereas, Spark is faster than SciDB in JOIN and MAP operations (thanks to the general power of the Spark execution engine. )…”

Section: Discussionmentioning

confidence: 99%

Experiences in the Development of a Data Management System for Genomics

Ceri

Canakoglu

Kaitoua

et al. 2018

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today's genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available. In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today's big data are raw reads of the sequencing machines, tomorrow's big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions. Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

show abstract

Section: Discussionmentioning

confidence: 99%

Experiences in the Development of a Data Management System for Genomics

Ceri

Canakoglu

Kaitoua

et al. 2018

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…(i) In terms of processing speed, Apache Flink outperforms other resource management frameworks for small, medium, and large datasets [30,36]. However, during our own set of experiments on Amazon EC2 cluster with varied task managers settings (1-4 task managers per node), Flink failed to complete custom smaller size JVM dataset jobs due to inefficient memory management of Flink memory manager.…”

Section: Observations and Findingsmentioning

confidence: 98%

“…The study results showed that Spark performed up to three times better than MapReduce for most of the cases. Bertoni et al [36] performed the experimental evaluation of Apache Flink and Storm using large genomic dataset data on Amazon EC2 cloud. Apache Flink was superior to Storm while performing histogram and map operations while Storm outperformed Flink while genomic join application was deployed.…”

Section: Processing Speedmentioning

confidence: 99%

Big Data in Cloud Computing: A Resource Management Perspective

Ullah

Awan

Khiyal

2018

Scientific Programming

View full text Add to dashboard Cite

The modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data. Such multidimensional datasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from them. Analyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changing market trends and customer requirements. While the Hadoop framework is a popular platform for processing larger datasets, there are a number of other computing infrastructures, available to use in various application domains. The primary focus of the study is how to classify major big data resource management systems in the context of cloud computing environment. We identify some key features which characterize big data frameworks as well as their associated challenges and issues. We use various evaluation metrics from different aspects to identify usage scenarios of these platforms. The study came up with some interesting findings which are in contradiction with the available literature on the Internet.

show abstract

“…We are currently completing the cluster installation at CINECA, so we have not yet a full set of performance figures. However, in [5] we have deployed the architecture discussed in this section on the Amazon Web Services (AWS) cloud, using a configuration with m3.2xlarge machines, each with 8 virtual CPUs, 30GB of memory, and 2 x80 GB of SSD storage. The testing setup contained one driver node and three configurations of slave nodes, set at 10, 15, and 19 nodes respectively.…”

Section: Performance Testingmentioning

confidence: 99%

Scalable Genomic Data Management System on the Cloud

Kaitoua

Gulino

Masseroli

et al. 2017

2017 International Conference on High Performance Computing &Amp; Simulation (HPCS)

Self Cite

View full text Add to dashboard Cite

Thanks to the huge amount of sequenced data that is becoming available, building scalable solutions for supporting query processing and data analysis over genomics datasets is increasingly important. This paper presents GDMS, a scalable Genomic Data Management System for querying region-based genomic datasets; the focus of the paper is on the deployment of the system on a cluster hosted by CINECA.

show abstract

Evaluating cloud frameworks on genomic applications

Cited by 22 publications

References 14 publications

Experiences in the Development of a Data Management System for Genomics

Experiences in the Development of a Data Management System for Genomics

Big Data in Cloud Computing: A Resource Management Perspective

Scalable Genomic Data Management System on the Cloud

Contact Info

Product

Resources

About