Supporting scalable analytics with latency constraints

Li, Boduo; Diao, Yanlei; Shenoy, Prashant

doi:10.14778/2809974.2809979

Cited by 32 publications

(23 citation statements)

References 33 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each dataflow program, we model each user objective as a function over all tunable parameters of the runtime system. Learning such a model for each user objective and a specific cluster environment has the potential to adapt to different objectives, hardware, and software characteristics, while static models [2,3,6] often fail to adapt due to hard-coded function shapes and constants.…”

Section: Key Techniquesmentioning

confidence: 99%

“…(a) Big-Bench (TPCx-BB) for batch analytics includes 30 workloads, which can be divided into 14 SQL tasks, 11 SQL with UDFs and 5 ML workloads. (b) We also designed a new stream benchmark by extending previous workloads on click stream analysis [3] to include stream SQL queries, SQL+UDF queries, and machine learning tasks. As suggested by our industry collaborators, these workloads have been parameterized in different ways to control the similarities among workloads.…”

Section: Demonstrationmentioning

confidence: 99%

See 1 more Smart Citation

Udao

Zaouk

Song

Lyu³

et al. 2019

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

Big data analytics systems today still lack the ability to take user performance goals and budgetary constraints, collectively referred to as "objectives", and automatically configure an analytic job to achieve the objectives. This paper presents UDAO, a unified data analytics optimizer that can automatically determine the parameters of the runtime system, collectively called a job configuration, for general dataflow programs based on user objectives. UDAO embodies key techniques including in-situ modeling, which learns a model for each user objective in the same computing environment as the job is run, and multi-objective optimization, which computes a Pareto optimal set of job configurations to reveal tradeoffs between different objectives. Using benchmarks developed based on industry needs, our demonstration will allow the user to explore (1) learned models to gain insights into how various parameters affect user objectives; (2) Pareto frontiers to understand interesting tradeoffs between different objectives and how a configuration recommended by the optimizer explores these tradeoffs; (3) endto-end benefits that UDAO can provide over default configurations or those manually tuned by engineers.

show abstract

Section: Key Techniquesmentioning

confidence: 99%

Section: Demonstrationmentioning

confidence: 99%

Udao

Zaouk

Song

Lyu³

et al. 2019

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Twitter Heron [38] does user defined thread allocation and mapping by Aurora scheduler. In the paper [42] proposed an analytical model for resource allocation and dynamic mapping to meet latency requirement while maximizing throughput, for processing real time streams on hadoop. Stela [72] uses effective throughput percentage (ETP) as the metric to decide the task to be scaled when user requests scaling in/out with given number of machines.…”

Section: Scheduling For Dspsmentioning

confidence: 99%

Model-driven scheduling for distributed stream processing systems

Shukla

Simmhan

2018

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Distributed Stream Processing frameworks are being commonly used with the evolution of Internet of Things(IoT). These frameworks are designed to adapt to the dynamic input message rate by scaling in/out.Apache Storm, originally developed by Twitter is a widely used stream processing engine while others includes Flink [8] Spark streaming [73]. For running the streaming applications successfully there is need to know the optimal resource requirement, as over-estimation of resources adds extra cost.So we need some strategy to come up with the optimal resource requirement for a given streaming application. In this article, we propose a model-driven approach for scheduling streaming applications that effectively utilizes a priori knowledge of the applications to provide predictable scheduling behavior. Specifically, we use application performance models to offer reliable estimates of the resource allocation required. Further, this intuition also drives resource mapping, and helps narrow the estimated and actual dataflow performance and resource utilization. Together, this model-driven scheduling approach gives a predictable application performance and resource utilization behavior for executing a given DSPS application at a target input stream rate on distributed resources.

show abstract

“…An analytic model for MapReduce tasks to compute latencies is presented in [25]. [26] proposes a stochastic cost model for generic workflow tasks but does not consider different degrees of parallelism.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage; complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.

show abstract

Supporting scalable analytics with latency constraints

Cited by 32 publications

References 33 publications

Udao

Udao

Model-driven scheduling for distributed stream processing systems

Dynamic Configuration of Partitioning in Spark Applications

Contact Info

Product

Resources

About