2015
DOI: 10.14778/2824032.2824035
|View full text |Cite
|
Sign up to set email alerts
|

Principles of dataset versioning

Abstract: The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 59 publications
(16 citation statements)
references
References 20 publications
0
16
0
Order By: Relevance
“…Our goals in the experimental evaluation are to answer the following key questions: (1) What is the speedup in execution time from 6 https://www.kaggle.com/c/zillow-prize-1 7 https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…Our goals in the experimental evaluation are to answer the following key questions: (1) What is the speedup in execution time from 6 https://www.kaggle.com/c/zillow-prize-1 7 https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py…”
Section: Resultsmentioning
confidence: 99%
“…In addition, computing intermediates by re-running the model for each analytic query not only slows down the process of model diagnosis but can also be unacceptable for interactive query workloads. Thus, the bottleneck in supporting efficient and widely usable model diagnosis is caused by two data management questions: (a) how do we store large amounts of data efficiently for storage and querying (e.g., as in [6,37,48]); and (b) how do we trade-off intermediate storage vs. recreation (as in [7,22,52])?…”
Section: Mistique: Storing Model Intermediatesmentioning
confidence: 99%
See 2 more Smart Citations
“…Data Versioning: There has been some work [4,28] on selectively storing and deleting dataset versions. Although we are not dealing with dataset versions, it is useful to examine this related line of work.…”
Section: Related Workmentioning
confidence: 99%