Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data 2011
DOI: 10.1145/1989323.1989426
|View full text |Cite
|
Sign up to set email alerts
|

A platform for scalable one-pass analytics using MapReduce

Abstract: Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
89
0
1

Year Published

2012
2012
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 130 publications
(96 citation statements)
references
References 21 publications
0
89
0
1
Order By: Relevance
“…After generating a random separating plane (lines [18][19], it iteratively runs another map function and a reduce function to calculate a new gradient (lines [20][21][22][23][24][25][26]. Here each call of this map function will create a new DenseVector object.…”
Section: Motivating Examplementioning
confidence: 99%
See 1 more Smart Citation
“…After generating a random separating plane (lines [18][19], it iteratively runs another map function and a reduce function to calculate a new gradient (lines [20][21][22][23][24][25][26]. Here each call of this map function will create a new DenseVector object.…”
Section: Motivating Examplementioning
confidence: 99%
“…Furthermore, to improve the performance of multi-stage and iterative computations, recently developed systems support caching of intermediate data in the main memory [29,32,38,39] and exploit eager combining and aggregating of data in the shuffling phases [22,31]. These techniques would generate massive long-living data objects in the heap, which usually stay in the memory for a significant portion of the job execution time.…”
Section: Introductionmentioning
confidence: 99%
“…Support for incremental processing of new data is also studied in [71,72]. The motivation is to support one-pass analytics for applications that continuously generate new data.…”
Section: Avoiding Redundant Processingmentioning
confidence: 99%
“…Map-Reduce-Merge [29] N/A N/A N/A N/A N/A Map-Join-Reduce [58] N/A N/A N/A N/A N/A Afrati et al [5,6] No No Hash-based "share"-based No Repartition join [18] Yes No Hash-based No No Broadcast join [18] Yes No Broadcast Broadcast R No Semi-join [18] Yes No Broadcast Broadcast No Per-split semi-join [18] Yes Hadoop++ [36] No, based on using UDFs HAIL [37] Yes, changes the RecordReader and a few UDFs CoHadoop [41] Yes, extends HDFS and adds metadata to NameNode Llama [74] No, runs on top of Hadoop Cheetah [28] No, runs on top of Hadoop RCFile [50] No changes to Hadoop, implements certain interfaces CIF [44] No changes to Hadoop core, leverages extensibility features Trojan layouts [59] Yes, introduces Trojan HDFS (among others) MRShare [83] Yes, modifies map outputs with tags and writes to multiple output files on the reduce side ReStore [40] Yes, extends the JobControlCompiler of Pig Sharing scans [11] Independent of system Silva et al [95] No, integrated into SCOPE Incoop [17] Yes, new file system, contraction phase, and memoization-aware scheduler Li et al [71,72] Yes, modifies the internals of Hadoop by replacing key components Grover et al [47] Yes, introduces dynamic job and Input Provider EARL [67] Yes, RecordReader and Reduce classes are modified, and simple extension to Hadoop to support dynamic input and efficient resampling Top-k queries [38] Yes, changes data placement and builds statistics RanKloud [24] Yes, integrates its execution engine into Hadoop and uses local B+Tree indexes HaLoop [22,23] Yes, use of caching and changes to the scheduler MapReduce online [30] Yes, communication between Map and Reduce, and to JobTracker and TaskTracker NOVA [85] No, runs on top of Pig and Hadoop Twister [39] Adopts an ...…”
Section: Join Typementioning
confidence: 99%
“…We have found two categories of related work in this space: Incremental Processing. Significant effort has been made recently to extend the traditional MapReduce paradigm to break the barrier between the Map and the Reduce phases, allowing reducers to run on partial results from mappers (e.g., [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]). One of the key differences across these is the way the sorting phase is handled: Some apply it on partitions of the post-mapper or the pre-reducer stages, while others do so in chunks using a spill file.…”
Section: B Incremental Mapreducementioning
confidence: 99%