Interpreting the Data: Parallel Analysis with Sawzall

Pike, Rob; Dorward, Sean; Griesemer, Robert; Quinlan, Sean

doi:10.1155/2005/962135

Cited by 376 publications

(235 citation statements)

References 14 publications

Supporting

Mentioning

233

Contrasting

Unclassified

Order By: Relevance

“…In [2] we compared Storacle with state-of-the-art off-the-shelf NoSQL and SQL data bases by using relevant benchmarks and taking into account the limitation of storage size and processing resources that may be present at machines in a substation. A format evaluation in [2] suggested the use of the Protocol Buffer format [15] as basis as it leads to required data size and retrieval time superior to other potential data bases in this use case. In [2] we further listed Cube, RRD4J, Cassandra, InfluxDB, neo4j and OpenTSDB and described why it is not recommended to use these already existing time series database systems for this use case.…”

Section: Storaclementioning

confidence: 99%

Provisioning, deployment, and operation of smart grid applications on substation level

et al. 2016

View full text Add to dashboard Cite

The transition of classical power distribution grids towards actively operated smart grids locates new functionality into intelligent secondary substations. Increased computational power and newly attained communication infrastructure in thousands of secondary substations allow for the distributed realization of sophisticated functions, which were inconceivable a few years ago. These novel functions (e.g., voltage and reactive power control, distributed generation optimization or decentralized market interaction) can primarily be realized by software components operated on powerful automation devices located on secondary substation level. is crucial and has a broad set of requirements. In this paper, we present a flexible and modular software ecosystem for automation devices of substations, which is able to handle these requirements. This ecosystem contains means for high performance data exchange and unification, automatic application provisioning and configuration functions, dependency management, and others. The application of the ecosystem is demonstrated in the context of a field operation example, which has been developed within an Austrian smart grid research project.

show abstract

Section: Storaclementioning

confidence: 99%

Provisioning, deployment, and operation of smart grid applications on substation level

et al. 2016

View full text Add to dashboard Cite

show abstract

“…Many of the individual systems that comprise this infrastructure have been the subject of academic publications [3,4,5,6,7,8,9,10] and received considerable interest, since they demonstrate practical approaches that have been deployed in live production environments on very large scales.…”

Section: Data-intensive Computingmentioning

confidence: 99%

“…For example, Sawzall [10] is an interpreted language for data analysis that is specifically designed to be integrated with MapReduce as an underlying execution engine. A Sawzall program conceptually executes in parallel for every record in a data set, and may produce output by emitting records to any number of declared aggregators.…”

Section: Workflow Composition and High-level Languagesmentioning

confidence: 99%

Cogset: a high performance MapReduce engine

Valvåg

Johansen

Kvalnes

2012

Concurrency and Computation

View full text Add to dashboard Cite

MapReduce has become a widely employed programming model for large-scale data-intensive computations. Traditional MapReduce engines employ dynamic routing of data as a core mechanism for fault tolerance and load balancing. An alternative mechanism is static routing, which reduces the need to store temporary copies of intermediate data, but requires a tighter coupling between the components for storage and processing. The initial intuition motivating our work is that reading and writing less temporary data could improve performance, while the tight coupling of storage and processing could be leveraged to improve data locality.We therefore conjecture that a high-performance MapReduce engine can be based on static routing, while preserving the non-functional properties associated with traditional engines. To investigate this thesis, we design, implement, and experiment with Cogset, a distributed MapReduce engine that deviates considerably from the traditional design.We evaluate the performance of Cogset by comparing it to a widely used traditional MapReduce engine using a previously established benchmark. The results confirm our thesis that a high-performance MapReduce engine can be based on static routing, although analysis indicates that the reasons for Cogset's performance improvements are more subtle than expected. Through our work we develop a better understanding of static routing, its benefits and limitations, and its ramifications for a MapReduce engine.A secondary goal of our work is to explore how higher-level abstractions that are commonly built on top of MapReduce will interact with an execution engine based on static routing. Cogset is therefore designed with a generic, low-level core interface, upon which MapReduce is implemented as a relatively thin layer, as one of several supported programming interfaces.At its core, Cogset provides a few fundamental mechanisms for reliable and distributed storage of data, and parallel processing of statically partitioned data. While this dissertation mainly focuses on how these capabilities are leveraged to implement a distributed MapReduce engine, we also demonstrate how two other higher-level abstractions were built on top of Cogset. These may serve as alternative access points for data-intensive applications, and illustrate how some of the lessons learned from Cogset can be applicable in a broader context.

show abstract

“…Several distributed job execution engines have been proposed [5,4,15,25], and several highlevel job description languages have been defined [7,16,[26][27][28]. However, complex scientific analysis tasks are only just beginning to be ported to these new platforms.…”

Section: Background and Related Workmentioning

confidence: 99%

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

Kwon

Nunley

Gardner

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

show abstract

Interpreting the Data: Parallel Analysis with Sawzall

Cited by 376 publications

References 14 publications

Provisioning, deployment, and operation of smart grid applications on substation level

Provisioning, deployment, and operation of smart grid applications on substation level

Cogset: a high performance MapReduce engine

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

Contact Info

Product

Resources

About