Helping scientists reconnect their datasets

Alawini, Abdussalam; Maier, David; Tufte, Kristin; Howe, Bill

doi:10.1145/2618243.2618263

Cited by 12 publications

(9 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Helping data scientist to match and explore heterogeneous datasets, even when their scheme is unknown or unfamiliar, is an active and interesting area of research with multiple ramifications [10,14], one of which is schema matching [1]. To the best of our knowledge, there has been no detailed discussions on how this can be achieved on multidimensional spaces when uncertainty is unavoidable.…”

Section: Discussionmentioning

confidence: 99%

Inference of Common Multidimensional Equally-Distributed Attributes

Ayllón¹,

Palomo-Duarte²,

Dodero³

2021

Preprint

View full text Add to dashboard Cite

Given two relations containing multiple measurements -possibly with uncertainties -our objective is to find which sets of attributes from the first have a corresponding set on the second, using exclusively a sample of the data. This approach could be used even when the associated metadata is damaged, missing or incomplete, or when the volume is too big for exact methods. This problem is similar to the search of Inclusion Dependencies (IND), a type of rule over two relations asserting that for a set of attributes X from the first, every combination of values appears on a set Y from the second. Existing IND can be found exploiting the existence of a partial order relation called specialization. However, this relation is based on set theory, requiring the values to be directly comparable. Statistical tests are an intuitive possible replacement, but it has not been studied how would they affect the underlying assumptions. In this paper we formally review the effect that a statistical approach has over the inference rules applied to IND discovery. Our results confirm the intuitive thought that statistical tests can be used, but not in a directly equivalent manner. We provide a workable alternative based on a "hierarchy of null hypotheses", allowing for the automatic discovery of multi-dimensional equally distributed sets of attributes.

show abstract

Section: Discussionmentioning

confidence: 99%

Inference of Common Multidimensional Equally-Distributed Attributes

Ayllón¹,

Palomo-Duarte²,

Dodero³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In [13], Pochampally et al propose to model correlations between different data sources using joint precision (portion of correct outputs over entire outputs) and joint recall (portion of all correct triples that are output by all sources) as indicators. In comparison, the work in [14] relies on history and schema of data sets to map and link them together. In [15], Roy et al use the concept of intervention (i.e, changes in the values of inputs affect the outputs) to look for causal explanation for the answers of SQL queries.…”

Section: Related Workmentioning

confidence: 99%

AMIC: An Adaptive Information Theoretic Method to Identify Multi-Scale Temporal Correlations in Big Time Series Data

et al. 2021

IEEE Trans. Big Data

View full text Add to dashboard Cite

Recent development in computing, sensing and crowd-sourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called Big Data to inform research and the decision-making process are virtually endless. In general, analyses have to be done across multiple data sets in order to bring out the most value of Big Data. A first important step is to identify temporal correlations between data sets. Given the characteristics of Big Data in terms of volume and velocity, techniques that identify correlations not only need to be fast and scalable, but also need to help users in ordering the correlations across temporal scales so that they can focus on important relationships. In this paper, we present AMIC (Adaptive Mutual Information-based Correlation), a method based on mutual information to identify correlations at multiple temporal scales in large time series. Discovered correlations are suggested to users in an order based on the strength of the relationships. Our method supports an adaptive streaming technique that minimizes duplicated computation and is implemented on top of Apache Spark for scalability. We also provide a comprehensive evaluation on the effectiveness and the scalability of AMIC using both synthetic and real-world data sets.

show abstract

“…It is often desirable to reconstruct a human-interpretable lineage for such various versions. As demonstrated in a user study from prior work [111], detecting the relationship among datasets can enable users to recall transformations from one dataset version to another, and subsequently help users identify the best dataset for a given task. As revealed in Example 8.1, a real workflow written by some data scientist, feature engineering and data quality play a critical role in the performance of a machine learning task.…”

Section: Additional Related Workmentioning

confidence: 99%

“…ReConnect [111] attempts to discover the relationship for a given dataset pair. It first defines a space of relevant relationships, generates the conditions for each relationship based on row and column statistics, and then suggests a relationship for a given dataset pair by examining the conditions.…”

Section: Additional Related Workmentioning

confidence: 99%

Effective Data Versioning for Collaborative Data Analytics

Huang

2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

With the massive proliferation of datasets in a variety of sectors, data science teams in these sectors spend vast amounts of time collaboratively constructing, curating, and analyzing these datasets. Versions of datasets are routinely generated during this data science process, via various data processing operations like data transformation and cleaning, feature engineering and normalization, among others. However, no existing systems enable us to effectively store, track, and query these versioned datasets, leading to massive redundancy in versioned data storage and making true collaboration and sharing impossible. In this thesis, we develop solutions for versioned data management for collaborative data analytics. In the first part of this thesis, we extend a relational database to support versioning of structured data. Specifically, we build a system, OrpheusDB, on top of a relational database with a carefully designed data representation and an intelligent partitioning algorithm for fast version control operations. OrpheusDB inherits much of the same benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. However, OrpheusDB implicitly makes a few assumptions, namely that: (a) the SQL assumption: a SQL-like language is the best fit for querying data and versioning information; (b) the structural assumption: the data is in a relational format with a regular structure; (c) the from-scratch assumption: users adopt OrpheusDB from the very beginning of their project and register each data version along with full metadata in the system. In the second part of this thesis, we remove each of these assumptions, one at a time. First, we remove the SQL assumption and propose a generalized query language for querying data along with versioning and provenance information. Second, we remove the structural assumption and develop solutions for compact storage and fast retrieval of arbitrary data representations. Finally, we remove the "from-scratch" assumption, by developing techniques to infer lineage relationships among versions residing in an existing data repository. ii To my parents and my husband, for their love and support. iii ACKNOWLEDGMENTS First of all, I would like to express my sincere appreciation to my awesome advisor Professor Aditya Parameswaran. It is my great fortune to be working with him during my whole Ph.D. study. I learned a lot from him, not only about how to do research but also about how to be a good researcher. He helped me build confidence in myself and makes me believe I can make it. I will always remember and cherish these valuable memories. Thanks a lot. I have also been fortunate to work with some other great professors-Professors Aaron Elmore, Saurabh Sinha, and Amol Deshpande. Thanks, Aaron for his kind support and thought-provoking discussions during our four years' collaboration. I enjoyed working with him a lot. Thanks, Saurabh for providing me much flexibility and many useful suggestions on our project. Thanks, Amol for giving m...

show abstract

Helping scientists reconnect their datasets

Cited by 12 publications

References 18 publications

Inference of Common Multidimensional Equally-Distributed Attributes

Inference of Common Multidimensional Equally-Distributed Attributes

AMIC: An Adaptive Information Theoretic Method to Identify Multi-Scale Temporal Correlations in Big Time Series Data

Effective Data Versioning for Collaborative Data Analytics

Contact Info

Product

Resources

About