Principles of dataset versioning

Bhattacherjee, Souvik; Chavan, Amit; Huang, Silu; Deshpande, Amol; Parameswaran, Aditya

doi:10.14778/2824032.2824035

Cited by 59 publications

(16 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our goals in the experimental evaluation are to answer the following key questions: (1) What is the speedup in execution time from 6 https://www.kaggle.com/c/zillow-prize-1 7 https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py…”

Section: Resultsmentioning

confidence: 99%

“…In addition, computing intermediates by re-running the model for each analytic query not only slows down the process of model diagnosis but can also be unacceptable for interactive query workloads. Thus, the bottleneck in supporting efficient and widely usable model diagnosis is caused by two data management questions: (a) how do we store large amounts of data efficiently for storage and querying (e.g., as in [6,37,48]); and (b) how do we trade-off intermediate storage vs. recreation (as in [7,22,52])?…”

Section: Mistique: Storing Model Intermediatesmentioning

confidence: 99%

“…CIFAR10 contains 50K training images from 10 classes where each image has dimensions 64x64x3. We evaluate on two models trained on CIFAR10: the VGG16 model fine-tuned on CIFAR10, denoted as CIFAR10_VGG16 (the original model has been trained on the IMAGENET [41] dataset) and a well-accepted, simple CNN model trained from scratch, denoted as CIFAR10_CNN 7 . The original VGG16 model consists of 13 convolutional layers and 3 fully connected layers.…”

Section: Dnn Models (Dnn)mentioning

confidence: 99%

“…On the side of array databases, [42] tackled the question of storing multiple versions of array data by taking advantage of delta encoding and compression. Subsequent work [7] in a similar vein, but for relational data, addressed the question of storing vs. re-creating dataset versions. While the techniques proposed in this work are powerful, they have limited applicability in our setting because our intermediates are not versions of the same dataset and the complete set of versions is not known apriori.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Mistique

Vartak

Trindade

Madden

et al. 2018

Proceedings of the 2018 International Conference on Management of Data

View full text Add to dashboard Cite

Model diagnosis is the process of analyzing machine learning (ML) model performance to identify where the model works well and where it doesn't. It is a key part of the modeling process and helps ML developers iteratively improve model accuracy. Often, model diagnosis is performed by analyzing different datasets or intermediates associated with the model such as the input data and hidden representations learned by the model (e.g., [4, 24, 39]). The bottleneck in fast model diagnosis is the creation and storage of model intermediates. Storing these intermediates requires tens to hundreds of GB of storage whereas re-running the model for each diagnostic query slows down model diagnosis. To address this bottleneck, we propose a system called MISTIQUE that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis. For each diagnostic query, MISTIQUE intelligently chooses whether to rerun the model or read a previously stored intermediate. For intermediates that are stored in MISTIQUE, we propose a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication. We evaluate our techniques on a range of real-world ML models in scikit-learn and Tensorflow. We demonstrate that our optimizations reduce storage by up to 110X for traditional ML pipelines and up to 6X for deep neural networks. Furthermore, by using MISTIQUE, we can speed up diagnostic queries on traditional ML pipelines by up to 390X and 210X on deep neural networks.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Mistique: Storing Model Intermediatesmentioning

confidence: 99%

Section: Dnn Models (Dnn)mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Mistique

Vartak

Trindade

Madden

et al. 2018

Proceedings of the 2018 International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…Data Versioning: There has been some work [4,28] on selectively storing and deleting dataset versions. Although we are not dealing with dataset versions, it is useful to examine this related line of work.…”

Section: Related Workmentioning

confidence: 99%

R2D2: Reducing Redundancy and Duplication in Data Lakes

Shah,

Mukherjee,

Tyagi

et al. 2023

Proc. ACM Manag. Data

View full text Add to dashboard Cite

Enterprise data lakes often suffer from substantial amounts of duplicate and redundant data, with data volumes ranging from terabytes to petabytes. This leads to both increased storage costs and unnecessarily high maintenance costs for these datasets. In this work, we focus on identifying and reducing redundancy in enterprise data lakes by addressing the problem of "dataset containment". To the best of our knowledge, this is one of the first works that addresses table-level containment at a large scale. We propose R2D2: a three-step hierarchical pipeline that efficiently identifies almost all instances of containment by progressively reducing the search space in the data lake. It first builds (i) a schema containment graph, followed by (ii) statistical min-max pruning, and finally, (iii) content level pruning. We further propose minimizing the total storage and access costs by optimally identifying redundant datasets that can be deleted (and reconstructed on demand) while respecting latency constraints. We implement our system on Azure Databricks clusters using Apache Spark for enterprise data stored in ADLS Gen2, and on AWS clusters for open-source data. In contrast to existing modified baselines that are inaccurate or take several days to run, our pipeline can process an enterprise customer data lake at the TB scale in approximately 5 hours with high accuracy. We present theoretical results as well as extensive empirical validation on both enterprise (scale of TBs) and open-source datasets (scale of MBs - GBs), which showcase the effectiveness of our pipeline.

show abstract