RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Kruse, Sebastian; Kaoudi, Zoi; Contreras-Rojas, Bertty; Chawla, Sanjay; Naumann, Felix; Quiané-Ruiz, Jorge

doi:10.1007/s00778-020-00612-x

Cited by 11 publications

(6 citation statements)

References 41 publications

(64 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, RHEEM [2] enhanced the concept by allowing a particular subtask of the workflow to be assigned to a specific platform, in order to minimize the overall cost. It also introduces a novel cost-based cross-platform optimizer [27] that finds the most efficient platform for a task and an executor that orchestrates tasks over different platforms with intermediate data movement. Thus, RHEEM can integrate data from different data stores (hence act as a polystore) by assigning different operators from the query plan to different engines, e.g., perform selections on base tables and associated joins at the RDBMS to exploit indexes, then ship intermediate data and perform other joins at Spark to exploit parallelism.…”

Section: Related Workmentioning

confidence: 99%

Parallel query processing in a polystore

Kranas

Kolev

Levchenko

et al. 2021

Distrib Parallel Databases

View full text Add to dashboard Cite

The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store's native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets.In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

show abstract

Section: Related Workmentioning

confidence: 99%

Parallel query processing in a polystore

Kranas

Kolev

Levchenko

et al. 2021

Distrib Parallel Databases

View full text Add to dashboard Cite

show abstract

“…RheemLatin [4,22,24] is an extension from PigLatin [26]. Similar to ADIL, it has its native data models and grammars.…”

Section: Related Work 81 Polystore Languagesmentioning

confidence: 99%

“…Some prior work [10,14,20,22,24] mainly focuses on integrating multiple data processing platforms such as Spark, Hadoop, GraphX to process heterogeneous data, however, they do not focus on polystore which involves multiple DBMSs, thus they will not be discussed in this sections.…”

Section: Polystore Systemmentioning

confidence: 99%

See 1 more Smart Citation

AWESOME: Empowering Scalable Data Science on Social Media Data with an Optimized Tri-Store Data System

Zheng¹,

Dasgupta²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

Modern big data applications usually involve heterogeneous data sources and analytical functions, leading to increasing demand for polystore systems, especially analytical polystore systems. This paper presents AWESOME system along with a domain-specific language ADIL. ADIL is a powerful language which supports 1) native heterogeneous data models such as Corpus, Graph, and Relation; 2) a rich set of analytical functions; and 3) clear and rigorous semantics. AWESOME is an efficient tri-store middle-ware which 1) is built on the top of three heterogeneous DBMSs (Postgres, Solr, and Neo4j) and is easy to be extended to incorporate other systems; 2) supports the in-memory query engines and is equipped with analytical capability; 3) applies a cost model to efficiently execute workloads written in ADIL; 4) fully exploits machine resources to improve scalability. A set of experiments on real workloads demonstrate the capability, efficiency, and scalability of AWESOME.

show abstract

“…As of today, it supports a variety of platforms: Spark, Flink, PostgreSQL, GraphX, Giraph, and its in-memory Java-based executor 2 . Wayang originated from the Rheem project [3,13], is currently incubating in the Apache Software Foundation, and is used by several companies. In particular, Databloom, an AI startup, has been created around Wayang [2].…”

Section: Introductionmentioning

confidence: 99%

Apache Wayang: A Unified Data Analytics Framework

Beedkar,

Contreras-Rojas,

Gavriilidis

et al. 2023

SIGMOD Rec.

Self Cite

View full text Add to dashboard Cite

The large variety of specialized data processing platforms and the increased complexity of data analytics has led to the need for unifying data analytics within a single framework. Such a framework should free users from the burden of (i) choosing the right platform( s) and (ii) gluing code between the different parts of their pipelines. Apache Wayang (Incubating) is the only open-source framework that provides a systematic solution to unified data analytics by integrating multiple heterogeneous data processing platforms. It achieves that by decoupling applications from the underlying platforms and providing an optimizer so that users do not have to specify the platforms on which their pipeline should run. Wayang provides a unified view and processing model, effectively integrating the hodgepodge of heterogeneous platforms into a single framework with increased usability without sacrificing performance and total cost of ownership. In this paper, we present the architecture ofWayang, describe its main components, and give an outlook on future directions.

show abstract

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Cited by 11 publications

References 41 publications

Parallel query processing in a polystore

Parallel query processing in a polystore

AWESOME: Empowering Scalable Data Science on Social Media Data with an Optimized Tri-Store Data System

Apache Wayang: A Unified Data Analytics Framework

Contact Info

Product

Resources

About