Efficient detection of large-scale redundancy in enterprise file systems

Forman, George; Eshghi, Kave; Suermondt, Jaap

doi:10.1145/1496909.1496926

Cited by 25 publications

(17 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many identical directories created by individual users in enterprise file systems [11], such as software packages, copies of repositories, directories of photos or music. This suggests that if two directories share one file, other files may also be shared, a common form of file locality.…”

Section: B a Case For Exploiting File Semanticsmentioning

confidence: 99%

SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup

Tan

Jiang

Feng

et al. 2010

2010 39th International Conference on Parallel Processing

View full text Add to dashboard Cite

Existing de-duplication solutions in cloud backup environment either obtain high compression ratios at the cost of heavy de-duplication overheads in terms of increased latency and reduced throughput, or maintain small de-duplication overheads at the cost of low compression ratios causing high data transmission costs, which results in a large backup window. In this paper, we present SAM, a Semantic-Aware Multitiered source de-duplication framework that first combines the global file-level de-duplication and local chunk-level deduplication, and further exploits file semantics in each stage in the framework, to obtain an optimal tradeoff between the deduplication efficiency and de-duplication overhead and finally achieve a shorter backup window than existing approaches. Our experimental results with real world datasets show that SAM not only has a higher de-duplication efficiency/overhead ratio than existing solutions, but also shortens the backup window by an average of 38.7%.

show abstract

Section: B a Case For Exploiting File Semanticsmentioning

confidence: 99%

SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup

Tan

Jiang

Feng

et al. 2010

2010 39th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…• A deterministic solution that reports the exact metrics before the actual migration (i.e we find the exact space reclamation they produce and the associated penalties, such as network costs and physical space consumption). • We find optimal datasets for space reclamation that are significantly and consistently better than alternatives namely the strategy of migrating unique files and the strategy based on MinHash [5].…”

Section: Introductionmentioning

confidence: 94%

“…Also, the notion of similarity is drawn from disk sharing relationships across files and not the semantic content of the documents. We show in our evaluation that grouping files using hashing techniques such as MinimumHash [4] and approximate MinHash [5] does not work well for space reclamation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Rangoli

Nagesh

Kathpal

2013

Proceedings of the 6th International Systems and Storage Conference on - SYSTOR '13

View full text Add to dashboard Cite

Space management is the activity of monitoring and ensuring adequate free space on all volumes in a clustered storage system. Volumes that exceed used space limits are typically relieved by migrating a part of their data to other under utilized volumes. Without deduplication, space reclamation is simple as one has to just migrate as much data as the desired space reclamation. However, in deduped volumes there is no direct relation between the logical size of the file and the physical space occupied by it. Therefore, optimal space reclamation is hard as: a)migrating few files may produce little or zero bytes of free space, but still incur significant network costs. b)migrating a heavily shared file destroys the disk sharing relationships in that volume and increases the physical space consumption of that dataset.In this work, we have designed and built a fast and efficient tool Rangoli, that identifies the optimal set of files for space reclamation in a deduped environment. It can scale to millions of files and terabytes of data, running in tens of minutes. We show by experimenting on real world datasets, that alternate strategies such as those based on finding unique files or using MinHash, impact physical space consumption by a wide margin (up to 35 times) as compared to Rangoli.

show abstract

“…Copyright 2013 ACM 978-1-4503-2369-7/13/10, $15.00. applications in the context of content matching for online advertising [28], detection of redundancy in enterprise file systems [13], syntactic similarity algorithms for enterprise information management [8], Web spam [35], etc. The recent development of b-bit minwise hashing [23,24] provided a substantial improvement in the estimation accuracy and speed by proposing a new estimator that stores only the lowest b bits of each hashed value.…”

Section: Introductionmentioning

confidence: 99%

b-bit minwise hashing in practice

Shrivastava

König

2013

Proceedings of the 5th Asia-Pacific Symposium on Internetware

View full text Add to dashboard Cite

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26,32] demonstrated a potential use of b-bit minwise hashing [23,24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications.Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ∼ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers.Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.

show abstract

Efficient detection of large-scale redundancy in enterprise file systems

Cited by 25 publications

References 4 publications

SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup

SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup

Rangoli

b-bit minwise hashing in practice

Contact Info

Product

Resources

About