Automatic tuning of bag-of-tasks applications

Sahli, Majed; Mansour, Essam; Alturkestani, Tariq; Kalnis, Panos

doi:10.1109/icde.2015.7113338

Cited by 8 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”

Section: Discussion On the Provision Of End-to-end Solutionsmentioning

confidence: 99%

“…However, all these cost modeling and profiling techniques do not cover specific phenomena in Spark execution, such as super-linear speed-ups for small degrees of parallelism and performance degradation for large ones. The proposals in [13], [14] present a sampling-based approach to estimate the profile of a single embarrassingly parallel task, based on the behavior of some of its partitions. However, they assume that partitions are scheduled in multiple waves, whereas we have adopted a configuration, where all partitions are scheduled in a single wave but there are multiple interdependent tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage; complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.

show abstract

“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”

Section: Discussion On the Provision Of End-to-end Solutionsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Since the motifs search space is a combinatorial tree, it is logically partitioned into many sub-trees. In analytics workload, the number of sub-trees affects the utilization of the computing resources [20]. On a supercomputer, the StarQL optimizer estimates the query workload using a sampling technique and determines that 2,048 cores can be fully utilized.…”

Section: Parallel Support For Starql Operationsmentioning

confidence: 99%

Querying and Mining Strings Made Easy

Sahli

Mansour²,

Kalnis

2017

Advanced Data Mining and Applications

Self Cite

View full text Add to dashboard Cite

With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

show abstract

“…In order to utilize large infrastructures, it is critical to find the best decomposition and to accurately estimate runtimes. StarDB adopts our automatic tuning framework [8] to decide the problem decomposition and estimate serial and parallel runtimes. Random sample tasks are used to model the workload of different decompositions.…”

Section: Indexing and Large-scale Parallelismmentioning

confidence: 99%

“…StarDB uses our novel data structures [9] and parallel string algorithms [10] to natively facilitate large-scale analytics for strings. We incorporate our automatic tuning framework for large infrastructures [8] to meet users time and budget constraints. StarDB allows users to easily form complex string queries.…”

Section: Introductionmentioning

confidence: 99%

StarDB

2015

Self Cite

View full text Add to dashboard Cite

Strings and applications using them are proliferating in science and business. Currently, strings are stored in file systems and processed using ad-hoc procedural code. Existing techniques are not flexible and cannot efficiently handle complex queries or large datasets. In this paper, we demonstrate StarDB, a distributed database system for analytics on strings. StarDB hides data and system complexities and allows users to focus on analytics. It uses a comprehensive set of parallel string operations and provides a declarative query language to solve complex queries. StarDB automatically tunes itself and runs with over 90% efficiency on supercomputers, public clouds, clusters, and workstations. We test StarDB using real datasets that are 2 orders of magnitude larger than the datasets reported by previous works.

show abstract

Automatic tuning of bag-of-tasks applications

Cited by 8 publications

References 34 publications

Dynamic Configuration of Partitioning in Spark Applications

Dynamic Configuration of Partitioning in Spark Applications

Querying and Mining Strings Made Easy

StarDB

Contact Info

Product

Resources

About