2020
DOI: 10.1007/s00778-020-00612-x
|View full text |Cite
|
Sign up to set email alerts
|

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Abstract: Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and pla… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 41 publications
(64 reference statements)
0
6
0
Order By: Relevance
“…More recently, RHEEM [2] enhanced the concept by allowing a particular subtask of the workflow to be assigned to a specific platform, in order to minimize the overall cost. It also introduces a novel cost-based cross-platform optimizer [27] that finds the most efficient platform for a task and an executor that orchestrates tasks over different platforms with intermediate data movement. Thus, RHEEM can integrate data from different data stores (hence act as a polystore) by assigning different operators from the query plan to different engines, e.g., perform selections on base tables and associated joins at the RDBMS to exploit indexes, then ship intermediate data and perform other joins at Spark to exploit parallelism.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, RHEEM [2] enhanced the concept by allowing a particular subtask of the workflow to be assigned to a specific platform, in order to minimize the overall cost. It also introduces a novel cost-based cross-platform optimizer [27] that finds the most efficient platform for a task and an executor that orchestrates tasks over different platforms with intermediate data movement. Thus, RHEEM can integrate data from different data stores (hence act as a polystore) by assigning different operators from the query plan to different engines, e.g., perform selections on base tables and associated joins at the RDBMS to exploit indexes, then ship intermediate data and perform other joins at Spark to exploit parallelism.…”
Section: Related Workmentioning
confidence: 99%
“…RheemLatin [4,22,24] is an extension from PigLatin [26]. Similar to ADIL, it has its native data models and grammars.…”
Section: Related Work 81 Polystore Languagesmentioning
confidence: 99%
“…Some prior work [10,14,20,22,24] mainly focuses on integrating multiple data processing platforms such as Spark, Hadoop, GraphX to process heterogeneous data, however, they do not focus on polystore which involves multiple DBMSs, thus they will not be discussed in this sections.…”
Section: Polystore Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…As of today, it supports a variety of platforms: Spark, Flink, PostgreSQL, GraphX, Giraph, and its in-memory Java-based executor 2 . Wayang originated from the Rheem project [3,13], is currently incubating in the Apache Software Foundation, and is used by several companies. In particular, Databloom, an AI startup, has been created around Wayang [2].…”
Section: Introductionmentioning
confidence: 99%