iShuffle: Improving Hadoop Performance with Shuffle-on-Write

Guo, Yanfei; Rao, Jia; Cheng, Dazhao; Zhou, Xiaobo

doi:10.1109/tpds.2016.2587645

Cited by 81 publications

(48 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Within this framework, data shuffling often appears to limit the performance of distributed computing applications, including self-join [6], tera-sort [7], and machine learning algorithms [8]. For example, in a Facebook's Hadoop cluster, it is observed that 33% of the overall job execution time is spent on data shuffling [8].…”

mentioning

confidence: 99%

Fundamental tradeoff between computation and communication in distributed computing

Maddah-Ali

Avestimehr

2016

2016 IEEE International Symposium on Information Theory (ISIT)

165

551

View full text Add to dashboard Cite

Abstract-How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, i.e., the two are inversely proportional to each other.More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of "Map" and "Reduce" functions distributedly across multiple computing nodes. A coded scheme, named "Coded Distributed Computing" (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of r (i.e., evaluating each function at r carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor.An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme. As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized.Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by 1.97× -3.39×, for typical settings of interest.

show abstract

mentioning

confidence: 99%

Fundamental tradeoff between computation and communication in distributed computing

Maddah-Ali

Avestimehr

2016

2016 IEEE International Symposium on Information Theory (ISIT)

165

551

View full text Add to dashboard Cite

show abstract

“…However, this approach relies on the RDMA feature of Infiniband network, which is not available on commodity network hardware. iShuffle [19] proposed a independent shuffle service for multi-tenant Hadoop clusters. It decouples the shuffle and reduce, so that shuffle can be performed without running reduce tasks.…”

Section: Related Workmentioning

confidence: 99%

StoreApp: A shared storage appliance for efficient and scalable virtualized Hadoop clusters

Guo

Rao

Cheng

et al. 2015

2015 IEEE Conference on Computer Communications (INFOCOM)

Self Cite

View full text Add to dashboard Cite

Virtualizing Hadoop clusters provides many benefits, including rapid deployment, on-demand elasticity and secure multi-tenancy. However, a simple migration of Hadoop to a virtualized environment does not fully exploit these benefits. The dual role of a Hadoop worker, acting as both a compute node and a data node, makes it difficult to achieve efficient IO processing, maintain data locality, and exploit resource elasticity in the cloud. We find that decoupling per-node storage from its computation opens up opportunities for IO acceleration, locality improvement, and on-the-fly cluster resizing. To fully exploit these opportunities, we propose StoreApp, a shared storage appliance for virtual Hadoop worker nodes co-located on the same physical host. To completely separate storage from computation and prioritize IO processing, StoreApp pro-actively pushes intermediate data generated by map tasks to the storage node. StoreApp also implements late-binding task creation to take the advantage of prefetched data due to mis-aligned records. Experimental results show that StoreApp achieves up to 61% performance improvement compared to stock Hadoop and resizes the cluster to the (near) optimal degree of parallelism.

show abstract

“…One of the most popular cloud computing platform is Hadoop, an open source MapReduce implementation for processing large datasets. In the Hadoop context, the shuffle phase is the process of transferring data from mappers to reducers, which becomes the bottleneck in large jobs . In item‐based CF algorithm, computing the similarity matrix for items and calculating the prediction matrix for users are the most resource‐intensive operations; thus, reducing the intermediate data during shuffle phase can provide a substantial performance gain.…”

Section: Introductionmentioning

confidence: 99%

“…In the Hadoop context, the shuffle phase is the process of transferring data from mappers to reducers, which becomes the bottleneck in large jobs. 6 In item-based CF algorithm, computing the similarity matrix for items and calculating the prediction matrix for users are the most resource-intensive operations; thus, reducing the intermediate data during shuffle phase can provide a substantial performance gain. In this paper, we propose an optimized MapReduce for the item-based CF algorithm integrated with empirical factors.…”

Section: Introductionmentioning

confidence: 99%

CBMR: An optimized MapReduce for item‐based collaborative filtering recommendation algorithm with empirical analysis

2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary Item‐based collaborative filtering (CF) is a model‐based algorithm for making recommendations. In the algorithm, the similarity between items are calculated by using a number of similarity measures, and then these similarity values are used to predict ratings for users. However, if the number of items and users grows to millions, the scalability and the processing efficiency of item‐based CF can be hindered by some hardware constraints. To solve this problem, we propose an optimized MapReduce for item‐based CF algorithm integrated with empirical analysis. Through extensive experiments on real‐world datasets, we demonstrate the advantages of our approach by evaluating its execution time and by comparing its shuffle phase overhead with the conventional methods. The experimental results suggest that our approach has better performance when processing large‐scale datasets.

show abstract

iShuffle: Improving Hadoop Performance with Shuffle-on-Write

Cited by 81 publications

References 21 publications

Fundamental tradeoff between computation and communication in distributed computing

Fundamental tradeoff between computation and communication in distributed computing

StoreApp: A shared storage appliance for efficient and scalable virtualized Hadoop clusters

CBMR: An optimized MapReduce for item‐based collaborative filtering recommendation algorithm with empirical analysis

Contact Info

Product

Resources

About