Incorporating partitioning and parallel plans into the SCOPE optimizer

Zhou, Jingren; Larson, Per-Åke; Chaiken, Ronnie

doi:10.1109/icde.2010.5447802

Cited by 76 publications

(55 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It extends the work presented in [44] by introducing parallelization techniques for UDFs. An UDF is treated as a black-box operation.…”

Section: Parallelism In Traditional Data Flowmentioning

confidence: 89%

“…The scheduling policies are proposed to optimize the ETL workflow with respect to execution time and memory consumption. In the literature, there exist multiple methods that revolve around data flow parallelism [34,36,40,[43][44][45]. However, research on an ETL workflow parallelism has not appealed much consideration.…”

Section: Etl Workflow Optimization: Summarymentioning

confidence: 99%

“…The SCOPE optimizer introduces considerable parallelism based on a cost function. As presented in [44], the SCOPE optimizer generates a large number of execution plans by taking into account structural properties of data (e.g., partitioning, sorting, or grouping). The generated execution plans are then pruned using a cost model.…”

Section: Parallelism In Traditional Data Flowmentioning

confidence: 99%

“…A script (with annotations) similar to SQLScript [46] is used to express complex data flows containing ROs and UDFs together. The main goal is to parallelize ROs and UDFs together, which is achieved by directly translating a RO into the internal representation of the proposed cost-based optimizer as described in [44] and by applying the 'Worker-Farm' pattern [47] on an UDF. A complete set of annotations is described in [45].…”

Section: Parallelism In Traditional Data Flowmentioning

confidence: 99%

See 3 more Smart Citations

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

Ali

Wrembel

2017

The VLDB Journal

View full text Add to dashboard Cite

In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are analyzed w.r.t. their advantages, disadvantages, and challenges in the context of metrics such as autonomous behavior, support for quality metrics, and support for ETL activities as user-defined functions. We draw conclusions on still open research and technological issues in the field of ETL. Finally, we propose a theoretical ETL framework for ETL optimization.

show abstract

“…It extends the work presented in [44] by introducing parallelization techniques for UDFs. An UDF is treated as a black-box operation.…”

Section: Parallelism In Traditional Data Flowmentioning

confidence: 89%

Section: Etl Workflow Optimization: Summarymentioning

confidence: 99%

Section: Parallelism In Traditional Data Flowmentioning

confidence: 99%

Section: Parallelism In Traditional Data Flowmentioning

confidence: 99%

See 2 more Smart Citations

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

Ali

Wrembel

2017

The VLDB Journal

View full text Add to dashboard Cite

show abstract

“…In a distributed environment, an additional dimension is introduced into the join taxonomy: the join graph topology. Graph topologies specify how different partitions of data are processed in a distributed way, and is affected by the following factors [25]:…”

Section: Join Processingmentioning

confidence: 99%

Advanced join strategies for large-scale distributed computation

Bruno

Kwon

2014

Proc. VLDB Endow.

View full text Add to dashboard Cite

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.

show abstract

The Family of Map-Reduce

Sakr

Liu

2013

Large-Scale Data Analytics

View full text Add to dashboard Cite

Incorporating partitioning and parallel plans into the SCOPE optimizer

Cited by 76 publications

References 13 publications

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

From conceptual design to performance optimization of ETL workflows: current state of research and open problems

Advanced join strategies for large-scale distributed computation

The Family of Map-Reduce

Contact Info

Product

Resources

About