2015
DOI: 10.1093/bioinformatics/btv553
|View full text |Cite
|
Sign up to set email alerts
|

cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud

Abstract: Summary: One of the solutions proposed for addressing the challenge of the overwhelming abundance of genomic sequence and other biological data is the use of the Hadoop computing framework. Appropriate tools are needed to set up computational environments that facilitate research of novel bioinformatics methodology using Hadoop. Here, we present cl-dash, a complete starter kit for setting up such an environment. Configuring and deploying new Hadoop clusters can be done in minutes. Use of Amazon Web Services en… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 6 publications
0
3
0
Order By: Relevance
“…The new spectra-cluster algorithm was specifically developed using the Apache Hadoop framework13, 14 to reach two main goals: 1) to increase spectrum clustering accuracy and 2) to be scalable to handle the exponential data increase in PRIDE Archive. To increase spectrum clustering accuracy, based on the proportion of incorrectly clustered spectra, we developed a novel method to assess the similarity between two spectra: instead of the commonly used normalized dot product we employed a probabilistic scoring approach similar to that of the spectrum library search engine Pepitome15 (Online Methods).…”
Section: Resultsmentioning
confidence: 99%
“…The new spectra-cluster algorithm was specifically developed using the Apache Hadoop framework13, 14 to reach two main goals: 1) to increase spectrum clustering accuracy and 2) to be scalable to handle the exponential data increase in PRIDE Archive. To increase spectrum clustering accuracy, based on the proportion of incorrectly clustered spectra, we developed a novel method to assess the similarity between two spectra: instead of the commonly used normalized dot product we employed a probabilistic scoring approach similar to that of the spectrum library search engine Pepitome15 (Online Methods).…”
Section: Resultsmentioning
confidence: 99%
“…erefore, we have important significance for the study of job scheduling algorithms. e task scheduling algorithm is considered a complex process because it must make full use of the available resources to perform a large number of tasks [11]. It simplifies the file consistency model when it is stored.…”
Section: Introductionmentioning
confidence: 99%
“…Distributed computing has mainly been selected as the method for cloud computing. With the development of grid computing, computation on the cloud by Apache Hadoop has been conducted extensively [3,5,7,8] and support tools for constructing Hadoop clusters on the cloud have been established [17]. However, while Hadoop/MapReduce can easily construct a distributed task calculation environment, it is versatile and therefore contains an excessive amount of functions.…”
Section: Introductionmentioning
confidence: 99%