Lauritz Thamsen scite author profile

Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources in both type and number can often be challenging, as the selected configuration needs to match a distributed dataflow job's resource demands and access patterns. A good cluster configuration avoids hardware bottlenecks and maximizes resource utilization, avoiding costly overprovisioning.We propose a collaborative approach for finding optimal cluster configurations based on sharing and learning from historical runtime data of distributed dataflow jobs. Collaboratively shared data can be utilized to predict runtimes of future job executions through the use of specialized regression models. However, training prediction models on historical runtime data that were produced by different users and in diverse contexts requires the models to take these contexts into account.

show abstract

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Will

Thamsen²,

Scheinert³

et al. 2021

View full text Add to dashboard Cite

show abstract

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Scheinert

Thamsen²,

Zhu³

et al. 2021

View full text Add to dashboard Cite

show abstract

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Bader

Lehmann

Thamsen

et al. 2022

View full text Add to dashboard Cite

Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up.In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of shortrunning and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods.Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters. CCS CONCEPTS• Information systems → Information systems applications; • Computer systems organization → Distributed architectures; • Software and its engineering → Software architectures.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lauritz Thamsen

Selecting resources for distributed dataflow systems according to runtime targets

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Contact Info

Product

Resources

About