Optimizing analytic data flows for multiple execution engines

Simitsis, Alkis; Wilkinson, Kevin; Castellanos, Malú; Dayal, Umeshwar

doi:10.1145/2213836.2213963

Cited by 80 publications

(75 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Furthermore, having multiple data-intensive flows answering different requirements of end-users waiting for execution, the system requires an optimal schedule for running these data flows over the shared computational resources (e.g., shared, multi-tenant cluster), i.e., Flow Scheduler module. Lastly, the automatic optimization means must be also provided when deploying data flows, for selecting an optimal execution engine (e.g., [85,56]), as well as for providing the lower level, engine-specific, optimization of a data flow (i.e., the Flow Deployer module).…”

Section: Overall Discussionmentioning

confidence: 99%

“…While low data latency is desirable for ETL processes, due to limited time windows dedicated to the DW refreshment processes, in the next generation BI setting, having data-intensive flows with close to zero latency is a must. Other techniques include: choosing the optimal implementation for the flow operations [93], selecting the optimal execution engine for executing a data flow [85,56], data flow fragmentation and pipelining [52,86]. -Multi-flow.…”

Section: Optimization Inputmentioning

confidence: 99%

See 1 more Smart Citation

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Jovanovic

Romero

Abelló

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Data-intensive flows are central processes in today's business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysisready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of dataintensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today's research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.

show abstract

Section: Overall Discussionmentioning

confidence: 99%

Section: Optimization Inputmentioning

confidence: 99%

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Jovanovic

Romero

Abelló

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

“…In [12], [20], the problem of changing the type of the execution engine for each task while taking into account engine switching costs is tackled. Even if we treat different execution engines as different degrees of partitioning, the solutions in [12], [20] are inadequate for our bi-objective problem, while also the solutions in [20] cannot scale. Also, there are several techniques that try to allocate a DAG to the appropriate number of resources, but, typically, they do not consider any aspect that can correspond to repartitioning overhead, e.g., [21], [22].…”

Section: Related Workmentioning

confidence: 99%

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage; complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.

show abstract

“…QoX [SWCD12] is a special kind of loosely-coupled multistore system, where queries are analytical data-driven workflows (or data flows) that integrate data from relational databases, and various execution engines such as MapReduce or Extract-Transform-Load (ETL) tools. A typical data flow may combine unstructured data (e.g.…”

Section: Qoxmentioning

confidence: 99%

Query processing in multistore systems: an overview

Bondiombouy

Valduriez

2016

IJCC

View full text Add to dashboard Cite

Building cloud data-intensive applications often requires using multiple data stores (NoSQL, HDFS, RDBMS, etc.), each optimised for one kind of data and tasks. However, the wide diversification of data store interfaces makes it difficult to access and integrate data from multiple data stores. This important problem has motivated the design of a new generation of systems, called multistore systems, which provide integrated or transparent access to a number of cloud data stores through one or more query languages. In this paper, we give an overview of query processing in multistore systems. We start by introducing the recent cloud data management solutions and query processing in multidatabase systems. Then, we describe and analyse some representative multistore systems, based on their architecture, data model, query languages and query processing techniques. To ease comparison, we divide multistore systems based on the level of coupling with the underlying data stores, i.e., loosely-coupled, tightly-coupled and hybrid. Our analysis reveals some important trends, which we discuss. We also identify some major research issues.

show abstract

Optimizing analytic data flows for multiple execution engines

Cited by 80 publications

References 21 publications

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Dynamic Configuration of Partitioning in Spark Applications

Query processing in multistore systems: an overview

Contact Info

Product

Resources

About