A platform for scalable one-pass analytics using MapReduce

Li, Boduo; Mazur, Edward J.; Diao, Yanlei; McGregor, Andrew; Shenoy, P. Deepa

doi:10.1145/1989323.1989426

Cited by 130 publications

(96 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…After generating a random separating plane (lines [18][19], it iteratively runs another map function and a reduce function to calculate a new gradient (lines [20][21][22][23][24][25][26]. Here each call of this map function will create a new DenseVector object.…”

Section: Motivating Examplementioning

confidence: 99%

“…Furthermore, to improve the performance of multi-stage and iterative computations, recently developed systems support caching of intermediate data in the main memory [29,32,38,39] and exploit eager combining and aggregating of data in the shuffling phases [22,31]. These techniques would generate massive long-living data objects in the heap, which usually stay in the memory for a significant portion of the job execution time.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Lifetime-based memory management for distributed data processing systems

Shi

Zhou

et al. 2016

Proc. VLDB Endow.

View full text Add to dashboard Cite

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the userdefined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

show abstract

Section: Motivating Examplementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Lifetime-based memory management for distributed data processing systems

Shi

Zhou

et al. 2016

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Support for incremental processing of new data is also studied in [71,72]. The motivation is to support one-pass analytics for applications that continuously generate new data.…”

Section: Avoiding Redundant Processingmentioning

confidence: 99%

“…Map-Reduce-Merge [29] N/A N/A N/A N/A N/A Map-Join-Reduce [58] N/A N/A N/A N/A N/A Afrati et al [5,6] No No Hash-based "share"-based No Repartition join [18] Yes No Hash-based No No Broadcast join [18] Yes No Broadcast Broadcast R No Semi-join [18] Yes No Broadcast Broadcast No Per-split semi-join [18] Yes Hadoop++ [36] No, based on using UDFs HAIL [37] Yes, changes the RecordReader and a few UDFs CoHadoop [41] Yes, extends HDFS and adds metadata to NameNode Llama [74] No, runs on top of Hadoop Cheetah [28] No, runs on top of Hadoop RCFile [50] No changes to Hadoop, implements certain interfaces CIF [44] No changes to Hadoop core, leverages extensibility features Trojan layouts [59] Yes, introduces Trojan HDFS (among others) MRShare [83] Yes, modifies map outputs with tags and writes to multiple output files on the reduce side ReStore [40] Yes, extends the JobControlCompiler of Pig Sharing scans [11] Independent of system Silva et al [95] No, integrated into SCOPE Incoop [17] Yes, new file system, contraction phase, and memoization-aware scheduler Li et al [71,72] Yes, modifies the internals of Hadoop by replacing key components Grover et al [47] Yes, introduces dynamic job and Input Provider EARL [67] Yes, RecordReader and Reduce classes are modified, and simple extension to Hadoop to support dynamic input and efficient resampling Top-k queries [38] Yes, changes data placement and builds statistics RanKloud [24] Yes, integrates its execution engine into Hadoop and uses local B+Tree indexes HaLoop [22,23] Yes, use of caching and changes to the scheduler MapReduce online [30] Yes, communication between Map and Reduce, and to JobTracker and TaskTracker NOVA [85] No, runs on top of Pig and Hadoop Twister [39] Adopts an ...…”

Section: Join Typementioning

confidence: 99%

A survey of large-scale analytical query processing in MapReduce

2013

View full text Add to dashboard Cite

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties.This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-C. Doulkeridis

show abstract

“…We have found two categories of related work in this space: Incremental Processing. Significant effort has been made recently to extend the traditional MapReduce paradigm to break the barrier between the Map and the Reduce phases, allowing reducers to run on partial results from mappers (e.g., [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]). One of the key differences across these is the way the sorting phase is handled: Some apply it on partitions of the post-mapper or the pre-reducer stages, while others do so in chunks using a spill file.…”

Section: B Incremental Mapreducementioning

confidence: 99%

Stream as You Go: The Case for Incremental Data Access and Processing in the Cloud

Kienzler

Bruggmann

Ranganathan

et al. 2012

2012 IEEE 28th International Conference on Data Engineering Workshops

View full text Add to dashboard Cite

Abstract-Cloud infrastructures promise to provide highperformance and cost-effective solutions to large-scale data processing problems. In this paper, we identify a common class of data-intensive applications for which data transfer latency for uploading data into the cloud in advance of its processing may hinder the linear scalability advantage of the cloud. For such applications, we propose a "stream-as-you-go" approach for incrementally accessing and processing data based on a stream data management architecture. We describe our approach in the context of a DNA sequence analysis use case and compare it against the state of the art in MapReduce-based DNA sequence analysis and incremental MapReduce frameworks. We provide experimental results over an implementation of our approach based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2, showing an order of magnitude improvement in total processing time over the state of the art.

show abstract

A platform for scalable one-pass analytics using MapReduce

Cited by 130 publications

References 21 publications

Lifetime-based memory management for distributed data processing systems

Lifetime-based memory management for distributed data processing systems

A survey of large-scale analytical query processing in MapReduce

Stream as You Go: The Case for Incremental Data Access and Processing in the Cloud

Contact Info

Product

Resources

About