Job scheduling for a MapReduce cluster has been an active research topic in recent years. However, measurement traces from real-world production environment show that the duration of tasks within a job vary widely. The overall elapsed time of a job, i.e. the so-called flowtime, is often dictated by one or few slowly-running tasks within a job, generally referred as the "stragglers". The cause of stragglers include tasks running on partially/intermittently failing machines or the existence of some localized resource bottleneck(s) within a MapReduce cluster. To tackle this online job scheduling challenge, we adopt the task cloning approach and design the corresponding scheduling algorithms which aim at minimizing the weighted sum of job flowtimes in a MapReduce cluster based on the Shortest Remaining Processing Time scheduler (SRPT). To be more specific, we first design a 2-competitive offline algorithm when the variance of task-duration is negligible. We then extend this offline algorithm to yield the so-called SRPTMS+C algorithm for the online case and show that SRPTMS+C is (1 + ) − speed o( 1 2 ) − competitive in reducing the weighted sum of job flowtimes within a cluster. Both of the algorithms explicitly consider the precedence constraints between the two phases within the MapReduce framework. We also demonstrate via trace-driven simulations that SRPTMS+C can significantly reduce the weighted/unweighted sum of job flowtimes by cutting down the elapsed time of small jobs substantially. In particular, SRPTMS+C beats the Microsoft Mantri scheme by nearly 25% according to this metric.
We present a design framework of regenerating codes for distributed storage systems which employ binary additions and bit-wise cyclic shifts as the basic operations. The proposed coding method can be regarded as a concatenation coding scheme with the outer code being a binary cyclic code, and the inner code a regenerating code utilizing the binary cyclic code as the alphabet set. The advantage of this approach is that encoding and repair of failed node can be done with low computational complexity. It is proved that the proposed coding method can achieve the fundamental tradeoff curve between the storage and repair bandwidth asymptotically when the size of the data file is large.
Nowadays, a computing cluster in a typical data center can easily consist of hundreds of thousands of commodity servers, making component/ machine failures the norm rather than exception. A parallel processing job can be delayed substantially as long as one of its many tasks is being assigned to a failing machine. To tackle this so-called straggler problem, most parallel processing frameworks such as MapReduce have adopted various strategies under which the system may speculatively launch additional copies of the same task if its progress is abnormally slow or simply because extra idling resource is available. In this paper, we focus on the design of speculative execution schemes for a parallel processing cluster under different loading conditions. For the lightly loaded case, we analyze and propose two optimization-based schemes, namely, the Smart Cloning Algorithm (SCA) which is based on maximizing the job utility and the Straggler Detection Algorithm (SDA) which minimizes the overall resource consumption of a job. We also derive the workload threshold under which SCA or SDA should be used for speculative execution. Our simulation results show both SCA and SDA can reduce the job flowtime by nearly 60% comparing to the speculative execution strategy of Microsoft Mantri. For the heavily loaded case, we propose the Enhanced Speculative Execution (ESE) algorithm which is an extension of the Microsoft Mantri scheme. We show that the ESE algorithm can beat the Mantri baseline scheme by 18% in terms of job flowtime while consuming the same amount of resource.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.