Abstract. Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that "prime" the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk (SSD) drives. For HDD, the dominant factor affecting access speed is the compression rate achieved, even when this involves larger dictionaries and larger blocks. When the data is on SSD the same effects are present, but not as markedly, and more complex trade-offs apply.
Explosion of data growth and duplication of data in enterprises has led to the deployment of a variety of deduplication technologies. However not all deduplication technologies serve the needs of every workload. Most prior research in deduplication concentrates on fixed block size (or variable block size at a fixed block boundary) deduplication which provides sub-optimal space efficiency in workloads where the duplicate data is not block aligned. Workloads also differ in the nature of operations and their priorities thereby affecting the choice of the right flavor of deduplication. Object workloads for instance, hold multiple versions of archived documents that have a high degree of duplicate data. They are also write-once read-many in nature and follow a whole object GET, PUT and DELETE model and would be better served by a deduplication strategy that takes care of nonblock aligned changes to data. In this paper, we describe and evaluate a hybrid of a variable length and block based deduplication that is hierarchical in nature. We are motivated by the following insights from real world data: (a) object workload applications do not do in-place modification of data and hence new versions of objects are written again as a whole (b) significant amount of data among different versions of the same object is shareable but the changes are usually not block aligned. While the second point is the basis for variable length technique, both the above insights motivate our hierarchical deduplication strategy. We show through experiments with production data-sets from enterprise environments that this provides up to twice the space savings compared to a fixed block deduplication.
System log files contains messages emitted from several modules within a system and carries valuable information about the system state such as device status and error conditions and also about the various tasks within the system such as program names, execution path, including function names and parameters, and the task completion status. For customers with remote support, the system collects and transmits these logs to a central enterprise repository, where these are monitored for alerts, problem forecasting, and troubleshooting. Very large log files limit the interpretability for the support engineers. For an expert, a large volume of log messages may not pose any problem; however, an inexperienced person may get flummoxed due to the presence of a large number of log messages. Often it is desired to present the log messages in a comprehensive manner where a person can view the important messages first and then go into details if required. In this article, we present a user-friendly log viewer where we first hide the unimportant or inconsequential messages from the log file. A user can then click a particular hidden view and get the details of the hided messages. Messages with low utility are considered inconsequential as their removal does not impact the end user for the aforesaid purpose such as problem forecasting or troubleshooting. We relate the utility of a message to the probability of its appearance in the due context. We present machine-learning-based techniques that computes the usefulness of individual messages in a log file. We demonstrate identification and discarding of inconsequential messages to shrink the log size to acceptable limits. We have tested this over real-world logs and observed that eliminating such low value data can reduce the log files significantly (30% to 55%), with minimal error rates (7% to 20%). When limited user feedback is available, we show modifications to the technique to learn the user intent and accordingly further reduce the error.
Space management is the activity of monitoring and ensuring adequate free space on all volumes in a clustered storage system. Volumes that exceed used space limits are typically relieved by migrating a part of their data to other under utilized volumes. Without deduplication, space reclamation is simple as one has to just migrate as much data as the desired space reclamation. However, in deduped volumes there is no direct relation between the logical size of the file and the physical space occupied by it. Therefore, optimal space reclamation is hard as: a)migrating few files may produce little or zero bytes of free space, but still incur significant network costs. b)migrating a heavily shared file destroys the disk sharing relationships in that volume and increases the physical space consumption of that dataset.In this work, we have designed and built a fast and efficient tool Rangoli, that identifies the optimal set of files for space reclamation in a deduped environment. It can scale to millions of files and terabytes of data, running in tens of minutes. We show by experimenting on real world datasets, that alternate strategies such as those based on finding unique files or using MinHash, impact physical space consumption by a wide margin (up to 35 times) as compared to Rangoli.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.