Scientific datasets are growing rapidly and becoming critical to next-generation scientific discoveries. The validity of scientific results relies on the quality of data used and data are often subject to change, for example, due to observation additions, quality assessments, or processing software updates. The effects of data change are not well understood and difficult to predict. Datasets are often repeatedly updated and recomputing derived data products quickly becomes time consuming and resource intensive and may in some cases not even be necessary, thus delaying scientific advance.Despite its importance, there is a lack of systematic approaches for best comparing data versions to quantify the changes, and ad-hoc or manual processes are commonly used. In this article, we propose a novel hierarchical approach for analyzing data changes, including real-time (online) and offline analyses. We employ a variety of fast-to-compute numerical analyses, graphical data change representations, and more resource-intensive recomputations of a subset of the data product. We illustrate the application of our approach using three scientific diverse use cases, namely, satellite, cosmological, and x-ray data. The results show that a variety of data change metrics should be employed to enable a comprehensive representation and qualitative evaluation of data changes.
K E Y W O R D Sdata management, data versions, hierarchical data change analysis, QA/QC, scientific data change analysis
INTRODUCTIONNext-generation scientific discoveries are increasingly relying on processing of data from experiments, observations and simulations. The validity of scientific results relies on the quality of data. However, many scientific communities experience "data change." 1 Data are often published in the form of versions. A new version of a dataset may mean, for example, that new entries have been added to the dataset (time series or survey data), a new processing software has been used for quality assessment and control of the data, or the settings of a measurement device have been found to be incorrect and the data taken from the device and previously published had to be corrected. 2The need to take into account data changes can have huge implications on compute resources and researcher's time for reprocessing, storage requirements for managing multiple versions of the data, and the science results obtained from using the data. For example, in the environmental sciences, satellite data from the moderate resolution imaging spectroradiometer (MODIS) is used to calculate evapotranspiration (ET). 3 ET is an