Same Queries, Different Data: Can We Predict Runtime Performance?

Popescu, Adrian; Ercegovac, Vuk; Balmin, Andrey; Branco, M. De Oliveira; Ailamaki, Anastasia

doi:10.1109/icdew.2012.66

Cited by 30 publications

(25 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is impossible to evaluate HFSP in a real deployment and in the complete absence of estimation errors, since execution time of a given job in Hadoop varies at each run, according to complex and rather unpredictable system properties [12], [28]. To isolate the impact of errors on scheduling and sojourn time, we thus turn to our simulation results on SRVT which is a size-based scheduler with aging induced by fair sharing in virtual time.…”

Section: Estimation Errors and Sojourn Timesmentioning

confidence: 99%

See 1 more Smart Citation

HFSP: Size-based scheduling for Hadoop

Pastorelli

Barbuzzi

Carra

et al. 2013

2013 IEEE International Conference on Big Data

View full text Add to dashboard Cite

Abstract-Size-based scheduling with aging has, for long, been recognized as an effective approach to guarantee fairness and near-optimal system response times. We present HFSP, a scheduler introducing this technique to a real, multi-server, complex and widely used system such as Hadoop.Size-based scheduling requires a priori job size information, which is not available in Hadoop: HFSP builds such knowledge by estimating it on-line during job execution.Our experiments, which are based on realistic workloads generated via a standard benchmarking suite, pinpoint at a significant decrease in system response times with respect to the widely used Hadoop Fair scheduler, and show that HFSP is largely tolerant to job size estimation errors.

show abstract

Section: Estimation Errors and Sojourn Timesmentioning

confidence: 99%

“…Job Size Estimation: Various recent approaches [9]- [12] propose techniques to estimate query sizes in recurring jobs. Agarwal et al [11] report that recurring jobs are around 40% of all those running in Bing's production servers.…”

Section: Fairness and Qosmentioning

confidence: 99%

HFSP: Size-based scheduling for Hadoop

Pastorelli

Barbuzzi

Carra

et al. 2013

2013 IEEE International Conference on Big Data

View full text Add to dashboard Cite

show abstract

“…For running jobs we continue to refine our work estimates by extrapolating based on data from the completed tasks. All of this can be improved in the future, for example by incorporating the techniques in [19]. Better estimates should improve the quality of our FlowFlex scheduler.…”

Section: Cluster Experimentsmentioning

confidence: 99%

FlowFlex: Malleable Scheduling for Flows of MapReduce Jobs

Nagarajan¹,

Wolf²,

Balmin³

et al. 2013

Middleware 2013

Self Cite

View full text Add to dashboard Cite

Abstract. We introduce FlowFlex, a highly generic and effective scheduler for flows of MapReduce jobs connected by precedence constraints. Such a flow can result, for example, from a single user-level Pig, Hive or Jaql query. Each flow is associated with an arbitrary function describing the cost incurred in completing the flow at a particular time. The overall objective is to minimize either the total cost (minisum) or the maximum cost (minimax) of the flows. Our contributions are both theoretical and practical. Theoretically, we advance the state of the art in malleable parallel scheduling with precedence constraints. We employ resource augmentation analysis to provide bicriteria approximation algorithms for both minisum and minimax objective functions. As corollaries, we obtain approximation algorithms for total weighted completion time (and thus average completion time and average stretch), and for maximum weighted completion time (and thus makespan and maximum stretch). Practically, the average case performance of the FlowFlex scheduler is excellent, significantly better than other approaches. Specifically, we demonstrate via extensive experiments the overall performance of FlowFlex relative to optimal and also relative to other, standard MapReduce scheduling schemes. All told, FlowFlex dramatically extends the capabilities of the earlier Flex scheduler for singleton MapReduce jobs while simultaneously providing a solid theoretical foundation for both.

show abstract

“…Our approach uses minimal statistics about the input datasets (e.g., tuple size and number of tuples), which are complemented with historical information about prior query executions (e.g., execution time). More details on the predictions module have been published previously [27].…”

Section: Flex Schedulermentioning

confidence: 99%