2011 First International Conference on Data Compression, Communications and Processing 2011
DOI: 10.1109/ccp.2011.41
|View full text |Cite
|
Sign up to set email alerts
|

Quick Estimation of Data Compression and De-duplication for Large Storage Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 3 publications
0
5
0
Order By: Relevance
“…[6], [24]) has shown that in many large scale file systems, small files account for a large number of the files but only a small portion of total capacity (in some cases 99% of the files accounted for less then 10% of the capacity). So sampling randomly from the entire list of files may yield a sample set with a disproportional number of small files that actually have very little effect on the overall capacity.…”
Section: Estimation Via Sampling -Preliminariesmentioning
confidence: 98%
See 1 more Smart Citation
“…[6], [24]) has shown that in many large scale file systems, small files account for a large number of the files but only a small portion of total capacity (in some cases 99% of the files accounted for less then 10% of the capacity). So sampling randomly from the entire list of files may yield a sample set with a disproportional number of small files that actually have very little effect on the overall capacity.…”
Section: Estimation Via Sampling -Preliminariesmentioning
confidence: 98%
“…Finally, compression estimation was studied in [6] and [11]. The first paper sampled parts of each file which amounts to worst performance than our method and without accuracy guarantees.…”
Section: Related Workmentioning
confidence: 99%
“…Estimation Accuracy: We compare our CLA size estimators with a systematic excerpt [17] (first 0.01n rows), that allows to observe compression ratios. Table 5 reports the ARE (absolute ratio error) |Ŝ − S|/S of estimated sizeŜ (before corrections) to actual CLA compressed size S. CLA shows significantly better accuracy due to robustness against skew and effects of value tuples.…”
Section: Parameter Influence and Accuracymentioning
confidence: 99%
“…Compression Planning: The literature for compression and deduplication planning is relatively sparse and focuses on a priori estimation of compression ratios for heavyweight algorithms on generic data. A common strategy [17] is to experimentally compress a small segment of the data (excerpt) and observe the compression ratio. The drawbacks to this approach [26] are that (1) the segment may not be representative of the whole dataset and (2) the compression step can be very expensive because the runtime of many algorithms varies inversely with the achieved compression.…”
Section: Related Workmentioning
confidence: 99%
“…For example, to get the accurate size of a subset of files, one needs to deduplicate the subset of files with the help of the deduplication metadata. A quick method to estimate the deduplicated size of a file system before performing deduplication, thus without knowing the full deduplication metadata, is described by Constantinescu et al [22].…”
Section: Management Of Files On Deduplication-enabled Storagementioning
confidence: 99%