Abstract-Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute these "many-tasks" in parallel. In this paper, we present our experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pairwise Alu sequence alignment application and an EST (Expressed Sequence Tag) sequence assembly program. First we compare the performance of these cloud technologies using the above case and also compare them with traditional MPI implementation in one application. Next we analyze the effect of inhomogeneous data on the scheduling mechanisms of the cloud technologies. Finally we present a comparison of performance of the cloud technologies under virtual and non-virtual hardware platforms.
Abstract-Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute these "many-tasks" in parallel. In this paper, we present our experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pairwise Alu sequence alignment application and an EST (Expressed Sequence Tag) sequence assembly program. First we compare the performance of these cloud technologies using the above case and also compare them with traditional MPI implementation in one application. Next we analyze the effect of inhomogeneous data on the scheduling mechanisms of the cloud technologies. Finally we present a comparison of performance of the cloud technologies under virtual and non-virtual hardware platforms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.