Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data 2012
DOI: 10.1145/2213836.2213963
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing analytic data flows for multiple execution engines

Abstract: Next generation business intelligence involves data flows that span different execution engines, contain complex functionality like data/text analytics, machine learning operations, and need to be optimized against various objectives. Creating correct analytic data flows in such an environment is a challenging task and is both labor-intensive and timeconsuming. Optimizing these flows is currently an ad-hoc process where the result is largely dependent on the abilities and experience of the flow designer. Our p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
74
0
1

Year Published

2013
2013
2020
2020

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 80 publications
(75 citation statements)
references
References 21 publications
0
74
0
1
Order By: Relevance
“…Furthermore, having multiple data-intensive flows answering different requirements of end-users waiting for execution, the system requires an optimal schedule for running these data flows over the shared computational resources (e.g., shared, multi-tenant cluster), i.e., Flow Scheduler module. Lastly, the automatic optimization means must be also provided when deploying data flows, for selecting an optimal execution engine (e.g., [85,56]), as well as for providing the lower level, engine-specific, optimization of a data flow (i.e., the Flow Deployer module).…”
Section: Overall Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Furthermore, having multiple data-intensive flows answering different requirements of end-users waiting for execution, the system requires an optimal schedule for running these data flows over the shared computational resources (e.g., shared, multi-tenant cluster), i.e., Flow Scheduler module. Lastly, the automatic optimization means must be also provided when deploying data flows, for selecting an optimal execution engine (e.g., [85,56]), as well as for providing the lower level, engine-specific, optimization of a data flow (i.e., the Flow Deployer module).…”
Section: Overall Discussionmentioning
confidence: 99%
“…While low data latency is desirable for ETL processes, due to limited time windows dedicated to the DW refreshment processes, in the next generation BI setting, having data-intensive flows with close to zero latency is a must. Other techniques include: choosing the optimal implementation for the flow operations [93], selecting the optimal execution engine for executing a data flow [85,56], data flow fragmentation and pipelining [52,86]. -Multi-flow.…”
Section: Optimization Inputmentioning
confidence: 99%
“…In [12], [20], the problem of changing the type of the execution engine for each task while taking into account engine switching costs is tackled. Even if we treat different execution engines as different degrees of partitioning, the solutions in [12], [20] are inadequate for our bi-objective problem, while also the solutions in [20] cannot scale.…”
Section: Related Workmentioning
confidence: 99%
“…In [12], [20], the problem of changing the type of the execution engine for each task while taking into account engine switching costs is tackled. Even if we treat different execution engines as different degrees of partitioning, the solutions in [12], [20] are inadequate for our bi-objective problem, while also the solutions in [20] cannot scale. Also, there are several techniques that try to allocate a DAG to the appropriate number of resources, but, typically, they do not consider any aspect that can correspond to repartitioning overhead, e.g., [21], [22].…”
Section: Related Workmentioning
confidence: 99%
“…QoX [SWCD12] is a special kind of loosely-coupled multistore system, where queries are analytical data-driven workflows (or data flows) that integrate data from relational databases, and various execution engines such as MapReduce or Extract-Transform-Load (ETL) tools. A typical data flow may combine unstructured data (e.g.…”
Section: Qoxmentioning
confidence: 99%