Collaborative Science Workflows in SQL

Howe, Bill; Halperin, Daniel; Ribalet, François; Chitnis, Sanjay; Armbrust, E. Virginia

doi:10.1109/mcse.2013.42

Cited by 8 publications

(6 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on our experience with SQLShare [3], we believe that science users can write data analysis tasks in SQL. We expect Datalog's declarative style to have similar appeal, especially for recursive queries.…”

Section: Supported Query Languagesmentioning

confidence: 99%

“…Myria strikes a balance between these extremes: we adopt a core programming model that extends relational algebra with iteration that affords rich, iteration-aware optimization without sacrificing expressive power. Guided by prior experience in delivering databaseas-a-service capabilities to scientists [3], we aim to support both "users" and "algorithm designers" with a common set of web-based interfaces, languages, and APIs that scale gracefully from simple SPJ queries to advanced application-specific analytics tasks. Like Hyracks, we emphasize the use of core parallel query processing concepts as a first-class concern, but we place less emphasis on supporting legacy code written for Hadoop or Pregel and more emphasis on empowering non-specialists, especially scientists.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Demonstration of the Myria big data management service

Halperin

Almeida

Choo

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

In this demonstration, we will showcase Myria, our novel cloud service for big data management and analytics designed to improve productivity. Myria's goal is for users to simply upload their data and for the system to help them be self-sufficient data science experts on their data -self-serve analytics. From a web browser, Myria users can upload data, author efficient queries to process and explore the data, and debug correctness and performance issues. Myria queries are executed on a scalable, parallel cluster that uses both state-ofthe-art and novel methods for distributed query processing. Our interactive demonstration will guide visitors through an exploration of several key Myria features by interfacing with the live system to analyze big datasets over the web.

show abstract

Section: Supported Query Languagesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Demonstration of the Myria big data management service

Halperin

Almeida

Choo

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…All three languages are compiled to the same intermediate representation based on an extension of RA+While, then optimized to produce a parallel physical plan for execution on a cluster. Based on our experience with SQLShare [9] (described below), we know that science users can and will write data analysis tasks in declarative languages, but we seek new language features to capture a greater proportion of their tasks. Myria's execution layer, MyriaX, adopts state-of-the-art system design principles: it uses a pipelined, possibly cyclic graph of dataflow operators that make efficient use of I/O and memory, and it has built-in support for asynchronous evaluation of recursive queries.…”

Section: Big Data Systemsmentioning

confidence: 99%

“…The SQLShare experiment has been remarkably successful in demonstrating the utility of databases in new contexts; it currently has hundreds of science users who have uploaded several thousand datasets of varying size and complexity and issued tens of thousands of hand-written SQL queries. We have seen collections of scripts written in R and Python replaced with a handful of SQL queries, simplifying collaborative analysis to the exchange of links into SQLShare [9]. We have seen SQLShare used to facilitate open data and complexity hiding: at least one public dataset is a view that joins 50 distinct tables.…”

Section: User-facing Toolsmentioning

confidence: 99%

The database group at the University of Washington

2014

Self Cite

View full text Add to dashboard Cite

The database group at the University of Washington (UW) was founded in 1998 when the department hired Alon Halevy (now at Google). The group currently consists of about twenty researchers: three faculty members (the authors), four postdocs, and fifteen students. Alumni include faculty members at Computer Science Departments at British Columbia, Michigan, Pennsylvania, Stanford, UMass, Wisconsin, one faculty member at the CMU Tepper School of Business, and several researchers and engineers at Facebook, Google, Microsoft, Nokia, Twitter, and other technology companies. The group has funding from NSF, the Gordon and Betty Moore Foundation, the Alfred P. Sloan Foundation, and several companies including Amazon, EMC, Google, HP, Intel, Microsoft, NEC, and Yahoo. The group has been recognized through several best paper awards and two ACM SIGMOD Best Dissertation Awards.We conduct research mostly in small groups and tackle a diverse set of data management challenges. Some of our projects result from collaborations with domain scientists on the UW campus; others are sparked by novel theoretical breakthroughs that lead to new approaches to data management challenges; many are the results of both. We give here a short overview of the recent research themes in our group; more details are available on our website: http://db.cs.washington.edu/ SCIENTIFIC DATA MANAGEMENTOur research agenda is partially derived from collaborations with scientists across the University of Washington and beyond, leveraging our close connection with the University of Washington eScience Institute [6].The eScience Institute was founded in 2005 with the goal of advancing the research and practice of dataintensive discovery across all fields of science. With the advent of new, high-bandwidth data sources (survey telescopes, high-throughput sequencers, ubiquitous sensor networks, planetary-scale simulations), data management research became recognized as a critical driver of scientific discovery. As a result, the database group and the eScience Institute became close partners, and were able to initiate and maintain multiple long-term collaborations with scientists.In 2008, we founded an inter-disciplinary research group called AstroDB [1]. This group brings together faculty, research scientists, postdocs, and students from the Astronomy department and our database group. In 2009, we initiated an independent collaboration with a marine microbiology lab. Thanks to the sustained nature of these partnerships, both have led to a series of joint research projects. We give examples in the following sections.Our inter-disciplinary collaborations have also allowed us to collect a curated repository of datasets and use cases that anyone can use in their research: A repository of MapReduce applications [15], a public repository of scientific datasets equipped with a SQL interface [19], and a number of parallel analytics use cases that go beyond MapReduce [14]. We are continuously working on expanding these collections of applications. BIG DATA SYSTEMSMotivated b...

show abstract

“…Federated databases provide the ability to give users the feel of a data warehouse without physically moving data into a central repository [9]. As an example of a federated database, consider Myria [10,11], a distributed database that uses SQL or MyriaL as the language all of which was developed at the University of Washington. One of the challenges in database federation has been in developing a programming API that can be used to interact with the ever-increasing variety of databases and storage engines [12].…”

Section: Introductionmentioning

confidence: 99%

D4M: Bringing associative arrays to database engines

Gadepally

Kepner

Arcand

et al. 2015

2015 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

Abstract-The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple technologies, however, there is significant trouble in designing the movement of information between storage and database engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance of this connection.

show abstract

Collaborative Science Workflows in SQL

Cited by 8 publications

References 5 publications

Demonstration of the Myria big data management service

Demonstration of the Myria big data management service

The database group at the University of Washington

D4M: Bringing associative arrays to database engines

Contact Info

Product

Resources

About