Sharing across Multiple MapReduce Jobs

Nykiel, Tomasz; Potamias, Michalis; Mishra, Chaitanya; Kollios, George; Koudas, Nick

doi:10.1145/2560796

Cited by 61 publications

(122 citation statements)

References 43 publications

Supporting

Mentioning

120

Contrasting

Unclassified

Order By: Relevance

“…MRShare [83] is a sharing framework that identifies different queries (jobs) that share portions of identical work. Such queries do not need to be recomputed each time from scratch.…”

Section: Avoiding Redundant Processingmentioning

confidence: 99%

“…Map-Reduce-Merge [29] N/A N/A N/A N/A N/A Map-Join-Reduce [58] N/A N/A N/A N/A N/A Afrati et al [5,6] No No Hash-based "share"-based No Repartition join [18] Yes No Hash-based No No Broadcast join [18] Yes No Broadcast Broadcast R No Semi-join [18] Yes No Broadcast Broadcast No Per-split semi-join [18] Yes Hadoop++ [36] No, based on using UDFs HAIL [37] Yes, changes the RecordReader and a few UDFs CoHadoop [41] Yes, extends HDFS and adds metadata to NameNode Llama [74] No, runs on top of Hadoop Cheetah [28] No, runs on top of Hadoop RCFile [50] No changes to Hadoop, implements certain interfaces CIF [44] No changes to Hadoop core, leverages extensibility features Trojan layouts [59] Yes, introduces Trojan HDFS (among others) MRShare [83] Yes, modifies map outputs with tags and writes to multiple output files on the reduce side ReStore [40] Yes, extends the JobControlCompiler of Pig Sharing scans [11] Independent of system Silva et al [95] No, integrated into SCOPE Incoop [17] Yes, new file system, contraction phase, and memoization-aware scheduler Li et al [71,72] Yes, modifies the internals of Hadoop by replacing key components Grover et al [47] Yes, introduces dynamic job and Input Provider EARL [67] Yes, RecordReader and Reduce classes are modified, and simple extension to Hadoop to support dynamic input and efficient resampling Top-k queries [38] Yes, changes data placement and builds statistics RanKloud [24] Yes, integrates its execution engine into Hadoop and uses local B+Tree indexes HaLoop [22,23] Yes, use of caching and changes to the scheduler MapReduce online [30] Yes, communication between Map and Reduce, and to JobTracker and TaskTracker NOVA [85] No, runs on top of Pig and Hadoop Twister [39] Adopts an ...…”

Section: Join Typementioning

confidence: 99%

See 1 more Smart Citation

A survey of large-scale analytical query processing in MapReduce

2013

View full text Add to dashboard Cite

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties.This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-C. Doulkeridis

show abstract

“…MRShare [83] is a sharing framework that identifies different queries (jobs) that share portions of identical work. Such queries do not need to be recomputed each time from scratch.…”

Section: Avoiding Redundant Processingmentioning

confidence: 99%

Section: Join Typementioning

confidence: 99%

A survey of large-scale analytical query processing in MapReduce

2013

View full text Add to dashboard Cite

show abstract

“…The declarative property of these languages also open up new opportunities for automatic optimization in the framework [12,3,11]. Since different jobs (specified in or translated from queries) often perform similar work (e.g., jobs scanning the same input file or producing some shared map output), there are many opportunities to exploit the shared processing among the jobs to optimize performance.…”

Section: Introductionmentioning

confidence: 99%

“…Since different jobs (specified in or translated from queries) often perform similar work (e.g., jobs scanning the same input file or producing some shared map output), there are many opportunities to exploit the shared processing among the jobs to optimize performance. As noted by several researchers [13,12], it is useful to apply the ideas from multi-query optimization to optimize the processing of multiple jobs by avoiding redundant computation in the MapReduce framework.…”

Section: Introductionmentioning

confidence: 99%

“…The state-of-the-art work in this direction is MRShare [12] which has proposed two sharing techniques for a batch of jobs. The share map input scan technique aims to share the scan of the input file for jobs, while the share map output technique aims to reduce the communication cost for map output tuples by generating only one copy of each shared map output tuple.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-query optimization in MapReduce framework

Wang

Chan

2013

Proc. VLDB Endow.

View full text Add to dashboard Cite

MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, finegrained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs scan the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, we propose two new techniques for multi-job optimization in the MapReduce framework. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output. The second is a materialization technique that enables multiple jobs to share both the scan of the input file as well as the communication of the common map output via partial materialization of the map output of some jobs (in the map and/or reduce phase). Our second contribution is the proposal of a new optimization algorithm that given an input batch of jobs, produces an optimal plan by a judicious partitioning of the jobs into groups and an optimal assignment of the processing technique to each group. Our experimental results on Hadoop demonstrate that our new approach significantly outperforms the state-of-the-art technique, MRShare, by up to 107%.

show abstract

Modeling and optimizing MapReduce programs

Dörre

Apel

Lengauer

2014

Concurrency and Computation

View full text Add to dashboard Cite

MapReduce frameworks allow programmers to write distributed, dataparallel programs that operate on multisets. These frameworks offer considerable flexibility to support various kinds of programs and data. To understand the essence of the programming model better and to provide a rigorous foundation for optimizations, we present an abstract, functional model of MapReduce along with a number of customization options. We demonstrate that the MapReduce programming model can also represent programs that operate on lists, which differ from multisets in that the order of elements matters. Along with the functional model, we offer a cost model that allows programmers to estimate and compare the performance of MapReduce programs. Based on the cost model, we introduce two transformation rules aiming at performance optimization of MapReduce programs, which also demonstrates the usefulness of our model. In an exploratory study, we assess the impact of applying these rules to two applications. The functional model and the cost model provide insights at a proper level of abstraction into why the optimization works.

show abstract

Sharing across Multiple MapReduce Jobs

Cited by 61 publications

References 43 publications

A survey of large-scale analytical query processing in MapReduce

A survey of large-scale analytical query processing in MapReduce

Multi-query optimization in MapReduce framework

Modeling and optimizing MapReduce programs

Contact Info

Product

Resources

About