Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Gufler, Benjamin; Augsten, Nikolaus; Reiser, Angelika; Kemper, Alfons

doi:10.1109/icde.2012.58

Cited by 96 publications

(47 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Then, the remaining keys (keygroups) of running tasks are tried to redistribute so that the capacity of the idle nodes is utilized. The approach in [5] is similar to our previous load balancing work [12] as it also relies on cardinality estimates determined during the map phase of the computation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

When to Reach for the Cloud: Using Parallel Hardware for Link Discovery

Ngomo

Kolb

Heino

et al. 2013

The Semantic Web: Semantics and Big Data

View full text Add to dashboard Cite

Abstract. With the ever-growing amount of RDF data available across the Web, the discovery of links between datasets and deduplication of resources within knowledge bases have become tasks of crucial importance. Over the last years, several link discovery approaches have been developed to tackle the runtime and complexity problems that are intrinsic to link discovery. Yet, so far, little attention has been paid to the management of hardware resources for the execution of link discovery tasks. This paper addresses this research gap by investigating the efficient use of hardware resources for link discovery. We implement the HR 3 approach for three different parallel processing paradigms including the use of GPUs and MapReduce platforms. We also perform a thorough performance comparison for these implementations. Our results show that certain tasks that appear to require cloud computing techniques can actually be accomplished using standard parallel hardware. Moreover, our evaluation provides break-even points that can serve as guidelines for deciding on when to use which hardware for link discovery.

show abstract

Section: Related Workmentioning

confidence: 99%

“…In this work we use OpenCL 4 , a vendor- agnostic industry standard. The memory model as exposed to OpenCL kernels is depicted in Figure 2: An instance of a compute kernel running on a device is called a work item or simply thread 5 . Work items are combined into work groups.…”

Section: General-purpose Computing On Gpusmentioning

confidence: 99%

When to Reach for the Cloud: Using Parallel Hardware for Link Discovery

Ngomo

Kolb

Heino

et al. 2013

The Semantic Web: Semantics and Big Data

View full text Add to dashboard Cite

show abstract

“…Then, the remaining keys (keygroups) of running tasks are tried to redistribute so that the capacity of the idle nodes is utilized. The approach in [7] is similar to our previous load balancing work [13] as it also relies on cardinality estimates determined during the map phase of the computation. This study as well as SkewTune are not focusing on entity resolution and cannot handle skew problems introduced by dominating blocks or key groups that need to be distributed among several reduce tasks.…”

Section: Related Workmentioning

confidence: 99%

“…Load balancing and skew handling are well-known problems for parallel data processing but have only recently gained attention for MapReduce [21,18,19,7]. [21] presents a theoretical analysis of skew effects for MR but focuses on linear processing of entities in the reduce phase while ER has quadratic complexity to compare entities with each other.…”

Section: Related Workmentioning

confidence: 99%

Parallel Entity Resolution with Dedoop

Kolb

Rahm

2012

Datenbank Spektrum

View full text Add to dashboard Cite

We provide an overview of Dedoop (Deduplication with Hadoop), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browserbased specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.

show abstract

“…Gufler et al [48] study the problem of handling data skew by means of an adaptive load balancing strategy. A cost estimation method is proposed to quantify the cost of the work assigned to reduce tasks, in order to ensure that this is performed fairly.…”

Section: Repartitioningmentioning

confidence: 99%

A survey of large-scale analytical query processing in MapReduce

2013

View full text Add to dashboard Cite

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties.This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-C. Doulkeridis

show abstract

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Cited by 96 publications

References 13 publications

When to Reach for the Cloud: Using Parallel Hardware for Link Discovery

When to Reach for the Cloud: Using Parallel Hardware for Link Discovery

Parallel Entity Resolution with Dedoop

A survey of large-scale analytical query processing in MapReduce

Contact Info

Product

Resources

About