The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increased cost efficiency of hard drives compared to tape, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns.To provide relevant input for the design of effective long-term data storage systems, we examined the workload behavior of several scientific and historical archives, covering a mixture of purposes, media types, and access models. Our findings show that, for scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in public content archives, we observed behavior that diverges from the traditional "write-once, read-maybe" behavior of tertiary storage. Our study shows that the majority of such data is modified relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.
For three decades, Kryder's law correctly predicted an exponential increase in bit density on disk platters, leading to an exponential drop in cost per gigabyte, and thus to an entrenched expectation that if data could be stored for a few years the incremental cost of storing it forever would be minimal. However, disk now is over 7 times as expensive as Kryder's law would have predicted, and industry projections suggest that in 2020 the gap will reach 200 times, disrupting this expectation.Our model shows that archives based upon alternative media are surprisingly cost competitive with archives based upon traditional disk media over the long-term. We propose using Archival Flash for long-term data preservation, with the trade off between longer data retention period and lower write cycles.
While file system metadata is well characterized by a variety of workload studies, scientific metadata is much less well understood. We characterize scientific metadata, in order to better understand the implications for index design. Based on our findings, existing solutions for either file system or scientific search will not suffice for indexing a large scientific file system.We describe the problems with existing solutions, and suggest column stores as an alternative approach.
Growth in disk capacity continues to outpace advances in read speed and device reliability. This has led to storage systems spending increasing amounts of time in a degraded state while failed disks reconstruct. Users and applications that do not use the data on the failed or degraded drives are negligibly impacted by the failure, increasing the perceived performance of the system. We leverage this observation with PERSES, a statistical data allocation scheme to reduce the performance impact of reconstruction after disk failure.PERSES reduces degradation from the perspective of the user by clustering data on disks such that data with high probability of co-access is placed on the same device as often as possible. Tracedriven simulations show that, by laying out data with PERSES, we can reduce the perceived time lost due to failure over three years by up to 80% compared to arbitrary allocation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.