Modeling and optimizing MapReduce programs

Dörre, Jens; Apel, Sven; Lengauer, Christian

doi:10.1002/cpe.3333

Cited by 14 publications

(14 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With cost models, we can find performance bugs and optimize programs before execution or after little profiling. Such a cost model was also given by Dörre et al [10].…”

Section: One-pass Mapreduce Implementation For Scanbspmentioning

confidence: 95%

“…Among several studies, two studies by Lämmel [15] and Dörre et al [10] gave detailed functional models of MapReduce computation.…”

Section: Differences From Previous Workmentioning

confidence: 99%

“…A cost model plays an important role in optimization: we can predict the performance of MapReduce programs before execution or with a little profiling of execution. On this topic, Dörre et al [10] wrote down the computation of Hadoop MapReduce and developed a cost model for MapReduce programs.…”

Section: Introductionmentioning

confidence: 99%

“…[12]) to obtain better programs from specifications. -Cost model We can also develop cost models [10] based on those functional models. A cost model plays an important role in optimization: we can predict the performance of MapReduce programs before execution or with a little profiling of execution.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Functional Models of Hadoop MapReduce with Application to Scan

Matsuzaki

2016

Int J Parallel Prog

View full text Add to dashboard Cite

MapReduce, first proposed by Google, is a remarkable programming model for processing very large amounts of data. An open-source implementation of MapReduce, called Hadoop, is now used for developing a wide range of applications. Although developing a correct and efficient program on MapReduce is much easier than developing one with MPI etc., it is still nontrivial if the target application requires involved functionalities of Hadoop MapReduce. Under these situations, functional models for MapReduce computation play important roles because we can utilize them for better understanding, proving the correctness, and even optimization of MapReduce programs. In this paper, we develop two functional models, a lowlevel one and a high-level one, which capture the semantics of Hadoop MapReduce computation. We discuss the detailed semantics mainly in terms of the following two computations: the computation of Mapper and Reducer classes and the computation in the Shuffle phase with the secondary-sorting technique. In addition, we develop MapReduce algorithms for the scan computational pattern (prefix sums) on the newly proposed models.

show abstract

“…With cost models, we can find performance bugs and optimize programs before execution or after little profiling. Such a cost model was also given by Dörre et al [10].…”

Section: One-pass Mapreduce Implementation For Scanbspmentioning

confidence: 95%

“…Among several studies, two studies by Lämmel [15] and Dörre et al [10] gave detailed functional models of MapReduce computation.…”

Section: Differences From Previous Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Functional Models of Hadoop MapReduce with Application to Scan

Matsuzaki

2016

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…In the last decade, new technologies based on the cloud model, offered to many organizations the possibility to store and analyze their data in an efficient way and a timely manner, which help them uncover patterns, get insight and provide better services. Hadoop, is an open source framework that offers a distributed storage layer, HDFS [1], tightly coupled with a distributed processing engine,MapReduce [2]. Hadoop allowsthe partitioning of data and computation across clusters of thousands of machines, in such a way, that each machine compute its local or neighbor's data.…”

Section: Introductionmentioning

confidence: 99%

Optimizing Hadoop for Small File Management

Elmahouti

Achandair

Khoulji

et al. 2017

TMLAI

View full text Add to dashboard Cite

HDFS is one of the most used distributed file systems, that offer a high availability and scalability on low cost hardware. HDFS is delivered as the storage component of Hadoop framework. Coupled with map reduce, which is the processing component, HDFS and MapReduce become the de facto platform for managing big data nowadays. However, HDFS was designed to handle specifically a huge number of large files, while when it comes to a large number of small files, Hadoop deployments may be not efficient. In this paper, we proposed a new strategy to manage small files. Our approach consists of two principal phases. The first phase is about consolidating more than only one client's small files input, and store the inputs continuously in the first allocated block, in a SequenceFile format, and so on into the next blocks. That way we avoid multiple block allocations for different streams, to reduce calls for available blocks and to reduce the metadata memory on the NameNode. This is because groups of small files packaged in a SequenceFile on the same block will require one entry instead of one for each small file. The second phase consists of analyzing attributes of stored small files to distribute them in such a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.

show abstract

iShare: Balancing I/O performance isolation and disk I/O efficiency in virtualized environments

Tao

Ling

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYPerformance isolation has long been a challenging problem for disk resource allocation in virtualized environments. While there have been many researches working on I/O performance isolation and disk utilization, none of them addresses the I/O performance isolation and disk utilization as a whole. To this end, we investigate the impact of current disk I/O performance isolation schemes on disk I/O utilization. Interestingly, our studies report that current isolation schemes bring unnecessary disk idle and reduce the overall disk I/O performance due to ignoring the disk states and characteristics of requests. Accordingly, we propose an adaptive proportional-share I/O scheduling framework, named iShare, in virtualized environments. iShare not only ensures I/O performance isolation through proportionally allocating time slices according to the weights of VMs, but also preserves high disk efficiency by detecting disk states and adaptively adjusting the time slice size based on characteristics of requests. We implement a prototype of iShare on the Xen platform. The experimental results show that iShare ensures I/O performance isolation while improving disk I/O efficiency: compared with Blkio (i.e, the default I/O performance isolation method in Xen), iShare increases disk I/O bandwidth by 58% and slightly improves the I/O performance isolation for the sequential write applications.

show abstract

Modeling and optimizing MapReduce programs

Cited by 14 publications

References 37 publications

Functional Models of Hadoop MapReduce with Application to Scan

Functional Models of Hadoop MapReduce with Application to Scan

Optimizing Hadoop for Small File Management

iShare: Balancing I/O performance isolation and disk I/O efficiency in virtualized environments

Contact Info

Product

Resources

About