Sector and Sphere: the design and implementation of a high-performance data cloud

Gu, Yunhong; Grossman, Robert L.

doi:10.1098/rsta.2009.0053

Cited by 116 publications

(52 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…An alternative approach is to fuse the distributed file system and processing engine into a single, tightly coupled component. This philosophy is characteristic of parallel databases, and is also embraced by others, for example in the twin systems Sector and Sphere [46]. These closely integrate the mechanisms for data processing with the storage layer, by offering the capability of evaluating user-defined functions locally on storage nodes.…”

Section: Alternative and Hybrid Architecturesmentioning

confidence: 99%

Cogset: a high performance MapReduce engine

Valvåg

Johansen

Kvalnes

2012

Concurrency and Computation

View full text Add to dashboard Cite

MapReduce has become a widely employed programming model for large-scale data-intensive computations. Traditional MapReduce engines employ dynamic routing of data as a core mechanism for fault tolerance and load balancing. An alternative mechanism is static routing, which reduces the need to store temporary copies of intermediate data, but requires a tighter coupling between the components for storage and processing. The initial intuition motivating our work is that reading and writing less temporary data could improve performance, while the tight coupling of storage and processing could be leveraged to improve data locality.We therefore conjecture that a high-performance MapReduce engine can be based on static routing, while preserving the non-functional properties associated with traditional engines. To investigate this thesis, we design, implement, and experiment with Cogset, a distributed MapReduce engine that deviates considerably from the traditional design.We evaluate the performance of Cogset by comparing it to a widely used traditional MapReduce engine using a previously established benchmark. The results confirm our thesis that a high-performance MapReduce engine can be based on static routing, although analysis indicates that the reasons for Cogset's performance improvements are more subtle than expected. Through our work we develop a better understanding of static routing, its benefits and limitations, and its ramifications for a MapReduce engine.A secondary goal of our work is to explore how higher-level abstractions that are commonly built on top of MapReduce will interact with an execution engine based on static routing. Cogset is therefore designed with a generic, low-level core interface, upon which MapReduce is implemented as a relatively thin layer, as one of several supported programming interfaces.At its core, Cogset provides a few fundamental mechanisms for reliable and distributed storage of data, and parallel processing of statically partitioned data. While this dissertation mainly focuses on how these capabilities are leveraged to implement a distributed MapReduce engine, we also demonstrate how two other higher-level abstractions were built on top of Cogset. These may serve as alternative access points for data-intensive applications, and illustrate how some of the lessons learned from Cogset can be applicable in a broader context.

show abstract

Section: Alternative and Hybrid Architecturesmentioning

confidence: 99%

Cogset: a high performance MapReduce engine

Valvåg

Johansen

Kvalnes

2012

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Recently, there have been a number of implementations of MapReduce and similar data processing tools [21,25,36,45,59,73,87]. Apache Hadoop was the most popular implementation of MapReduce at the start of the Magellan project and it continuous to gain traction in various communities.…”

Section: Mapreduce Programming Modelmentioning

confidence: 99%

The Magellan Final Report on Cloud Computing

Coghlan¹

2011

View full text Add to dashboard Cite

Executive SummaryThe goal of Magellan, a project funded through the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research (ASCR), was to investigate the potential role of cloud computing in addressing the computing needs for the DOE Office of Science (SC), particularly related to serving the needs of midrange computing and future data-intensive computing workloads. A set of research questions was formed to probe various aspects of cloud computing from performance, usability, and cost. To address these questions, a distributed testbed infrastructure was deployed at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Center (NERSC). The testbed was designed to be flexible and capable enough to explore a variety of computing models and hardware design points in order to understand the impact for various scientific applications. During the project, the testbed also served as a valuable resource to application scientists. Applications from a diverse set of projects such as MG-RAST (a metagenomics analysis server), the Joint Genome Institute, the STAR experiment at the Relativistic Heavy Ion Collider, and the Laser Interferometer Gravitational Wave Observatory (LIGO), were used by the Magellan project for benchmarking within the cloud, but the project teams were also able to accomplish important production science utilizing the Magellan cloud resources.Cloud computing has garnered significant attention from both industry and research scientists as it has emerged as a potential model to address a broad array of computing needs and requirements such as custom software environments and increased utilization among others. Cloud services, both private and public, have demonstrated the ability to provide a scalable set of services that can be easily and cost-effectively utilized to tackle various enterprise and web workloads. These benefits are a direct result of the definition of cloud computing: on-demand self-service resources that are pooled, can be accessed via a network, and can be elastically adjusted by the user. The pooling of resources across a large user base enables economies of scale, while the ability to easily provision and elastically expand the resources provides flexible capabilities.Following the Executive Summary we summarize the key findings and recommendations of the project. Greater detail is provided in the body of the report. Here we briefly summarize some of the high-level findings from the study.• Cloud approaches provide many advantages, including customized environments that enable users to bring their own software stack and try out new computing environments without significant administration overhead, the ability to quickly surge resources to address larger problems, and the advantages that come from increased economies of scale. Virtualization is the primary strategy of providing these capabilities. Our experience working with application scientists using the cloud demonstrated the power of virtualization to enable ...

show abstract

“…However, this post-processing phase can be very expensive since the output prior to filtering can become much larger than the final output; for instance, on the wiki-talk-3 graph the first enumeration phase takes 7 min (on 20 processors), and the second post-processing phase takes 228 min (on 80 processors). The algorithm is implemented for the Sector/Sphere [16] framework.…”

Section: Related Workmentioning

confidence: 99%

Mining maximal cliques from a large graph using MapReduce: Tackling highly uneven subproblem sizes

Svendsen

Mukherjee

Tirthapura

2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

We consider Maximal Clique Enumeration (MCE) from a large graph. A maximal clique is perhaps the most fundamental dense substructure in a graph, and MCE is an important tool to discover densely connected subgraphs, with numerous applications to data mining on web graphs, social networks, and biological networks. While effective sequential methods for MCE are known, scalable parallel methods for MCE are still lacking.We present a new parallel algorithm for MCE, Parallel Enumeration of Cliques using Ordering (PECO" role="presentation" style="box-sizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; maxwidth: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECO), designed for the MapReduce framework. Unlike previous works, which required a post-processing step to remove duplicate and non-maximal cliques, PECO" role="presentation" style="boxsizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; wordwrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; minwidth: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECOenumerates only maximal cliques with no duplicates. The key technical ingredient is a total ordering of the vertices of the graph which is used in a novel way to achieve a load balanced distribution of work, and to eliminate redundant work among processors. We implemented PECO" role="presentation" style="box-sizing: border-box; display: inline-block; line-height: normal; font-size: 14.4px; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">PECO on Hadoop MapReduce, and our experiments on a cluster show that the algorithm can effectively process a variety of large real-world graphs with millions of vertices and tens of millions of maximal cliques, and scales well with the degree of available parallelism. KeywordsGraph mining, Maximal clique enumeration, Enumeration algorithm, MapReduce, Hadoop, Parallel algorithm, Clique, Load balancing Disciplines Electrical and Computer EngineeringComments This is a manuscript of an article from Svendsen, Michael, Arko Provo Mukherjee, and Srikanta Tirthapura. "Mining maximal cliques from a large graph using mapreduce: Tackling highly uneven subproblem sizes. h i g h l i g h t s• Scalable method for enumerating maximal cliques in a graph using MapReduce.• Effective solution to load balancing.• Experimental evaluation of our solution on large real world graphs.• Outperforms previous MapReduce solutions by orders of magnitude. a r t i c l e i n f o b s t r a c tWe consider Maximal Clique Enumeration (MCE) from a large graph. A maximal clique is perhaps the most fundamental dense substru...

show abstract

Sector and Sphere: the design and implementation of a high-performance data cloud

Cited by 116 publications

References 18 publications

Cogset: a high performance MapReduce engine

Cogset: a high performance MapReduce engine

The Magellan Final Report on Cloud Computing

Mining maximal cliques from a large graph using MapReduce: Tackling highly uneven subproblem sizes

Contact Info

Product

Resources

About